Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building Lean 4.15-rc1 with default build settings runs into a Stack Overflow in Data/UInt/Lemmas.lean #6434

Open
1 task done
soulsource opened this issue Dec 21, 2024 · 18 comments
Labels
bug Something isn't working P-low We are not planning to work on this issue

Comments

@soulsource
Copy link

soulsource commented Dec 21, 2024

Prerequisites

Please put an X between the brackets as you perform the following steps:

Description

Building Lean 4.15-rc1 from source with the default build settings fails with a stack overflow when compiling Data/UInt/Lemmas.lean for stage1.

Context

I am not sure if this is specific to my system, or if others might be running into the same issue. Also, not sure why it doesn't happen in CI. Compiling Lean 4.14 with identical settings was working fine on the same machine.

Looking at the call-stack in the crashdump, there are over 3000 recursions of l_Lean_IR_NormalizeIds_normFnBody. However, it is not an infinite recursion, as increasing the thread stack size via LEAN_EXTRA_MAKE_OPTS allows to work around this build issue.
I haven't tested what stack size exactly is needed, but passing -DLEAN_EXTRA_MAKE_OPTS="-s 262144" to cmake allowed the build to finish.

Depending on the question if anyone else sees this, it might be a good idea to set a higher default stack size while building Lean.

Steps to Reproduce

  1. Compile Lean 4.15-rc1 from source with default settings on AMD64 Linux

Expected behavior: The build succeeds

Actual behavior: The build fails with a stack overflow when compiling Data/UInt/Lemmas.lean

Versions

Gentoo Linux, kernel 6.6.62, gcc (Gentoo 14.2.1_p20241116 p3) 14.2.1 20241116

Additional Information

I've used an ebuild to build lean4. Not sure if this is relevant, but for the sake of completeness:
Edit: This is the ebuild that already passes -DLEAN_EXTRA_MAKE_OPTS="-s 262144". To repro the issue with it, please remove the respective line.


MAJOR="$(ver_cut 1)"

CMAKE_MAKEFILE_GENERATOR="emake"
PYTHON_COMPAT=( python3_{10..12} )

inherit cmake flag-o-matic python-any-r1

DESCRIPTION="The Lean Theorem Prover"
HOMEPAGE="https://leanprover-community.github.io/"

if [[ "${PV}" == *9999* ]] ; then
	inherit git-r3

	EGIT_REPO_URI="https://github.com/leanprover/${PN}${MAJOR}.git"
else
	SRC_URI="https://github.com/leanprover/${PN}${MAJOR}/archive/refs/tags/v${PV/_/-}.tar.gz
		-> ${P}.tar.gz"
	S="${WORKDIR}/${PN}${MAJOR}-${PV/_/-}"

	KEYWORDS="~amd64 ~x86"
fi

LICENSE="Apache-2.0"
SLOT="0/${MAJOR}"
IUSE="debug source"

RDEPEND="
	dev-libs/gmp:=
	sci-mathematics/cadical
"
DEPEND="
	${RDEPEND}
"
BDEPEND="
	${PYTHON_DEPS}
"

pkg_setup() {
	python-any-r1_pkg_setup
}

src_prepare() {
	filter-lto

	sed -e "s|-O[23]|${CFLAGS}|g" -i src/CMakeLists.txt || die

	cmake_src_prepare
}

src_configure() {
	local CMAKE_BUILD_TYPE

	if use debug ; then
		CMAKE_BUILD_TYPE="Debug"
	else
		CMAKE_BUILD_TYPE="Release"
	fi

	local -a mycmakeargs=(
		-DLEAN_EXTRA_CXX_FLAGS="${CXXFLAGS}"
		-DLEAN_EXTRA_LINKER_FLAGS="${LDFLAGS}"
		-DLEAN_EXTRA_MAKE_OPTS="-s 262144"
		-DINSTALL_CADICAL=OFF
	)
	cmake_src_configure
}

src_install() {
	cmake_src_install

	rm "${ED}/usr/LICENSE"* || die

	if ! use source ; then
		rm -r "${ED}/usr/src" || die
	fi
}

Impact

Add 👍 to issues you consider important. If others are impacted by this issue, please ask them to add 👍 to it.

@soulsource soulsource added the bug Something isn't working label Dec 21, 2024
@soulsource soulsource changed the title Building Lean-4.15-rc1 with default build settings runs into a Stack Overflow in Data/UInt/Lemmas.lean Building Lean 4.15-rc1 with default build settings runs into a Stack Overflow in Data/UInt/Lemmas.lean Dec 21, 2024
@xuzhao9
Copy link

xuzhao9 commented Jan 5, 2025

I can reproduce this problem in openSUSE's packaging system: https://build.opensuse.org/package/show/home:nuklly/lean4

After using the suggested build commands, I can fix the stack overflow error: https://build.opensuse.org/package/show/home:nuklly/lean4. I think -O3 will cause this problem and -O2 will fix it.

@hargoniX
Copy link
Contributor

hargoniX commented Jan 5, 2025

Hi, first a general note, as we've explained in other issues such as #3068, packaging Lean itself is probably not a good idea right now. People require very specific Lean versions for their packages and unless those are present they are going to run into very confusing error messages as we currently have basically no guarantees for backwards compatibility.

Now regarding these errors, I (and to my knowledge all other people on the development team) do not work on SUSE or Gentoo and we can compile the project locally just fine on our machines. Those include at least Fedora, NixOS, MacOS and Windows. Furthermore our CI can also build Lean just fine on Ubuntu and some other platforms in various configurations. So there is likely something different about these distros than the one we currently work on mostly.

As a reference my local setup is currently gcc (GCC) 14.2.1 20240912 (Red Hat 14.2.1-3) together with ulimit -s at 8192 and I compile just as instructed here https://lean-lang.org/lean4/doc/make/index.html#generic-build-instructions. What's your ulimit -s? And can you at least build locally following the instructions there?

@soulsource
Copy link
Author

I'm running gcc (Gentoo 14.2.1_p20241221 p7) 14.2.1 20241221 and my ulimit -s is at 8MiB too.

The stack overflow I encountered wasn't related to ulimit, but rather lean's own thread stack size (the value overridable by the -s parameter). I am unsure, but looking at the crash dump this value already seems to be larger than 8MiB with the default build config - the call stack was about 128 MiB in my core dump.

I'll do a compile exactly as given on the generic-build-instructions later (probably tomorrow) to see if it yields different results.

@asdasd1dsadsa
Copy link
Contributor

Also

ln -f ../../lean4-9999_build/stage1/lib/temp/Init/Data/Vector/Lemmas.o.export ../../lean4-9999_build/stage1/lib/temp/Init/Data/Vector/Lemmas.o

Stack overflow detected. Aborting.

on Gentoo (gcc-13, glibc-2.39-r6), while building the latest commit with another ebuild. (Using system gmp and cadical).

Didn't try much. I can provide emerge --info if needed.

@soulsource
Copy link
Author

soulsource commented Jan 6, 2025

I've meanwhile tried to follow the linked build instructions for 4.15 (which are basically just "run cmake with release config and make"), and building it with my normal user. This build succeeded without issues.
(see attached script record - though it isn't really interesting, given that it worked)

I just realized that src/CMakeLists.txt adds -O3 to CMAKE_CXX_FLAGS_RELEASE, but the ebuild I am using (which is more or less a copy of the ebuild in Gentoo's portage tree) replaces this by my own system-wide CFLAGS, which are -O2 -pipe -march=znver3 -mtune=znver3.

Can it be that the overflow is tiny, and that changing the optimization flags makes enough of a difference that -O3 runs through, but -O2 does not?

Edit: Just tried -O2 with 4.15 now (script record), and it worked too... I'm now confused as to what could be the difference when using portage to build?

Edit 2: Building through portage is still failing:
lean-ebuild-typescript.gz

@asdasd1dsadsa
Copy link
Contributor

asdasd1dsadsa commented Jan 6, 2025

Edit 2: Building through portage is still failing: lean-ebuild-typescript.gz

Changing -O2 to -O3 also lead to the same error for me.

Complete build.log produced by portage:
build.log

Edit: Same failure if I commented out mycmakeargs in my ebuild file.

@soulsource
Copy link
Author

I'm also no longer sure if -DLEAN_EXTRA_MAKE_OPTS="-s 262144" actually has any effect. I was able to build 4.15-rc1 once with this set, but trying to do the same now, with the released 4.15 version, runs into a stack overflow again.

At this point I am actually not even sure it is a stack overflow.

Since I did manage to repro it, I can now also share the backtrace, in case it helps. The top of the call stack looks like this:

#0  0x00007f79d975677c in ?? () from /usr/lib64/libc.so.6
#1  0x00007f79d97062d6 in raise () from /usr/lib64/libc.so.6
#2  0x00007f79d96ee8f7 in abort () from /usr/lib64/libc.so.6
#3  0x00007f79df23335a in segv_handler () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#4  <signal handler called>
#5  0x00007f79da573005 in l_Lean_RBNode_ins___at_Lean_IR_addVarRename___spec__2 () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#6  0x00007f79da5731f0 in l_Lean_RBNode_ins___at_Lean_IR_addVarRename___spec__2 () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#7  0x00007f79da573742 in l_Lean_RBNode_ins___at_Lean_IR_addVarRename___spec__2 () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#8  0x00007f79da573742 in l_Lean_RBNode_ins___at_Lean_IR_addVarRename___spec__2 () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#9  0x00007f79da573742 in l_Lean_RBNode_ins___at_Lean_IR_addVarRename___spec__2 () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#10 0x00007f79da573742 in l_Lean_RBNode_ins___at_Lean_IR_addVarRename___spec__2 () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#11 0x00007f79da573742 in l_Lean_RBNode_ins___at_Lean_IR_addVarRename___spec__2 () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#12 0x00007f79da5731f0 in l_Lean_RBNode_ins___at_Lean_IR_addVarRename___spec__2 () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#13 0x00007f79da573742 in l_Lean_RBNode_ins___at_Lean_IR_addVarRename___spec__2 () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#14 0x00007f79da5731f0 in l_Lean_RBNode_ins___at_Lean_IR_addVarRename___spec__2 () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#15 0x00007f79da573742 in l_Lean_RBNode_ins___at_Lean_IR_addVarRename___spec__2 () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#16 0x00007f79da573742 in l_Lean_RBNode_ins___at_Lean_IR_addVarRename___spec__2 () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#17 0x00007f79da573742 in l_Lean_RBNode_ins___at_Lean_IR_addVarRename___spec__2 () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#18 0x00007f79da57c6e2 in l_Lean_RBNode_insert___at_Lean_IR_addVarRename___spec__1 () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#19 0x00007f79da741301 in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#20 0x00007f79da743ace in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#21 0x00007f79da743ace in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#22 0x00007f79da741328 in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#23 0x00007f79da743ace in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#24 0x00007f79da743ace in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#25 0x00007f79da741328 in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#26 0x00007f79da743ace in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#27 0x00007f79da743ace in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#28 0x00007f79da743ace in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#29 0x00007f79da741328 in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#30 0x00007f79da743ace in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#31 0x00007f79da741328 in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#32 0x00007f79da741328 in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#33 0x00007f79da741328 in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#34 0x00007f79da743ace in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#35 0x00007f79da743ace in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so
#36 0x00007f79da743ace in l_Lean_IR_NormalizeIds_normFnBody () from /tmp/portage/sci-mathematics/lean-4.15.0/work/lean4-4.15.0_build/stage0/bin/../lib/lean/libleanshared.so

Here is the full call stack, in case it's relevant:
gdb.txt

@soulsource
Copy link
Author

Hmmm... Sorry, I'm really out of the water here, as I do not usually have to deal with stack size limitations....

A closer look at the $sp value shows that it's likely running into ulimit stack size problems, and that I just misread the core dump last time.

The frame below the segfault handler has about 8 MiB offset compared to the main() function's stack frame, soooooo, I'm back at trying to convince portage to increase the default stack size.

@soulsource
Copy link
Author

soulsource commented Jan 6, 2025

In any case, this ebuild allowed me to compile Lean 4.15 via Portage right now. I only tried it once, but will try to use the same ebuild tomorrow on my laptop, where I have been seeing the same issue.

Compared to the ebuild in the original issue report, this now also sets ulimit -s to a higher value than the default 8 MiB.

EAPI=8

MAJOR="$(ver_cut 1)"

CMAKE_MAKEFILE_GENERATOR="emake"
PYTHON_COMPAT=( python3_{10..12} )

inherit cmake flag-o-matic python-any-r1

DESCRIPTION="The Lean Theorem Prover"
HOMEPAGE="https://leanprover-community.github.io/"

if [[ "${PV}" == *9999* ]] ; then
	inherit git-r3

	EGIT_REPO_URI="https://github.com/leanprover/${PN}${MAJOR}.git"
else
	SRC_URI="https://github.com/leanprover/${PN}${MAJOR}/archive/refs/tags/v${PV/_/-}.tar.gz
		-> ${P}.tar.gz"
	S="${WORKDIR}/${PN}${MAJOR}-${PV/_/-}"

	KEYWORDS="~amd64 ~x86"
fi

LICENSE="Apache-2.0"
SLOT="0/${MAJOR}"
IUSE="debug source"

RDEPEND="
	dev-libs/gmp:=
	sci-mathematics/cadical
"
DEPEND="
	${RDEPEND}
"
BDEPEND="
	${PYTHON_DEPS}
"

pkg_setup() {
	python-any-r1_pkg_setup
}

src_prepare() {
	filter-lto

	sed -e "s|-O[23]|${CFLAGS}|g" -i src/CMakeLists.txt || die

	cmake_src_prepare
}

src_configure() {
	local CMAKE_BUILD_TYPE

	if use debug ; then
		CMAKE_BUILD_TYPE="Debug"
	else
		CMAKE_BUILD_TYPE="Release"
	fi

	local -a mycmakeargs=(
		-DLEAN_EXTRA_CXX_FLAGS="${CXXFLAGS}"
		-DLEAN_EXTRA_LINKER_FLAGS="${LDFLAGS}"
		-DLEAN_EXTRA_MAKE_OPTS="-s 262144"
		-DINSTALL_CADICAL=OFF
	)
	cmake_src_configure
}

src_compile() {
	ulimit -s 30000000
	cmake_src_compile
}

src_install() {
	cmake_src_install

	rm "${ED}/usr/LICENSE"* || die

	if ! use source ; then
		rm -r "${ED}/usr/src" || die
	fi
}

Can it be that it depends on timing on which thread the crashing function runs? Like, that it can either be the main thread (which has a stack size set by ulimit -s) or a worker thread (which has a stack size set by -DLEAN_EXTRA_MAKE_OPTS="-s 262144"), depending on scheduling? Because that would explain why it worked once for me with just -DLEAN_EXTRA_MAKE_OPTS="-s 262144" set, but crashed several times now, until I also set ulimit -s?

Edit: Tried this ebuild on my laptop now too, and it worked first try. Soo, I think a combination of setting a higher ulimit -s and passing -DLEAN_EXTRA_MAKE_OPTS="-s 262144" indeed does the trick.

@asdasd1dsadsa
Copy link
Contributor

sed -e "s|-O[23]|${CFLAGS}|g" -i src/CMakeLists.txt || die
-DLEAN_EXTRA_MAKE_OPTS="-s 262144"
ulimit -s 30000000

Adding these to my ebuild worked for me.

@Kha
Copy link
Member

Kha commented Jan 10, 2025

@soulsource Do you also have a backtrack of a side-thread stack overflow? The one you posted will always be on main so so far I'm thinking this is strictly a ulimit -s issue which is outside our control

@soulsource
Copy link
Author

soulsource commented Jan 10, 2025 via email

@leanprover-bot leanprover-bot added the P-low We are not planning to work on this issue label Jan 10, 2025
@soulsource
Copy link
Author

Hmmm, I now commented out the -DLEAN_EXTRA_MAKE_OPTS="-s 262144" line in the ebuild and the build crashed with stack overflow, however it crashed in Thread 1 (which I assume to be the main thread) at a stack size of 8 MiB, what I find weird, given that the ebuild does still set ulimit -s 30000000.
I've retried it 3 times, and every time I ran into a stack overflow in Thread 1.

However, once I added back the -DLEAN_EXTRA_MAKE_OPTS="-s 262144" parameter, building worked reliably (as in: 2/2 tries) again...

So, what I can say is that it seems to always crash in the main thread, and that it still seems to require both, setting a higher ulimit -s and requesting a higher thread-stacksize from lean via the -s parameter....

@Kha
Copy link
Member

Kha commented Jan 10, 2025 via email

@soulsource
Copy link
Author

soulsource commented Jan 10, 2025 via email

@soulsource
Copy link
Author

Back at my PC - here's the full call stack for the crash with a higher ulimit -s, but without passing the -DLEAN_EXTRA_MAKE_OPTS="-s 262144" option.
gdb.txt

Just to confirm what the build environment is, here the build log (script), at the beginning of which you can also find a cat of the ebuild:
buildlog.typescript.gz

And here is the core file, and (hopefully) the corresponding binaries:
dump-part1.tar.gz
dump-part2.tar.xz.gz - sorry for putting an xz file in a gzip file, but github wouldn't let me upload the tar.xz file, and just using gzip would yield a too large file...

@Kha
Copy link
Member

Kha commented Jan 14, 2025

Okay, so there are two different call stacks that can overflow the main thread and a side thread, respectively, because your compiler creates stack frames that are larger than any in our CI for some reason. I'm not sure there is anything for us to accommodate on our side here except to document that various systems may require various build adjustments.

@soulsource
Copy link
Author

Yeah, I'm afraid that is the best way to handle this.

I had quickly thought about changing normFnBody to be tail-recursive, but that function isn't trivial enough to make this easy, and I don't think an easy-to-workaround compile issue warrants investing the time to change it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-low We are not planning to work on this issue
Projects
None yet
Development

No branches or pull requests

6 participants