-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clBuildProgram segv #71
Comments
I am asking the team to look into this. G |
Where can I get the exact kernel source, dependencies, and compiler options needed to reproduce the problem? |
The source code is here: Warning: the review site serves a tar.bz2 which is a tarbomb (no root directory). Extract, and from the build directory run:
The |
As this code in question is about to pass code review and is about to be merged which will prevent me from testing with ROCm, I'd be thankful if you can suggest an easy work-around that I can use until the compiler issue is fixed. |
I used ccmake to point precisely at the OpenCL_INCLUDE_DIR and OpenCL_LIBRARY that I want to use, and it tells me " OpenCL is not supported. OpenCL version 1.2 or newer is required." Is it true that the INCLUDE_DIR I point to should contain a directory named CL containing cl.h... and that LIBRARY should be a file named libOpenCL.so? If so, what else does it want? |
That should be enough. It seems to get rid of the "sticky" error you need by starting over with a clean cache (pass -DOpenCL_INCLUDE_DIR and -DOpenCL_LIBRARY to cmake) |
Thanks. I am able to build and run. What kind of device are you seeing the problem on? I am running using a debug build of the tip compiler on gfx803 and it looks like it's going to pass. What versiion of ROCm are you running? |
... NOTE 1 [file /tmp/gro/build/src/programs/mdrun/tests/Testing/Temporary/MdrunCanWrite_NptTrajectories_WithDifferentPcoupl_2_input.mdp, line 13]: Setting the LD random seed to 965332988 NOTE 2 [file /tmp/gro/build/src/programs/mdrun/tests/Testing/Temporary/MdrunCanWrite_NptTrajectories_WithDifferentPcoupl_2_input.mdp]: This run will generate roughly 0 Mb of data There were 2 notes Using 1 MPI thread 1 GPU auto-selected for this run. NOTE: Thread affinity was not set. Writing final coordinates.
Performance: 3.786 6.339 [----------] Global test environment tear-down YOU HAVE 45 DISABLED TESTS |
gfx803 and gfx900.
OK, but not sure what does that tell us?
|
ROCm/ROCm#404 (comment) says 1.9 will be releasing very soon. Since the problem is not showing up with the tip compiler, your issue was fixed sometime after 1.8 released. Hopefully it was picked up in 1.9. |
OK, looking forward to seeing the 1.9 not crash, but admittedly I'd be more relieved if somebody confirmed that the release branch is in fact fixed. (Unrelated, but I'm hoping that 1.8 debs won't get pulled so I can down- and upgrade freely.) |
Though if 1.9 is indeed dropping today, it won't be a long wait. |
After updating the toolchain to ROCm 1.9, I am still getting a clBuildProgram() segfault, so unfortunately this seems to have fallen through the cracks. How long until the next patch release? |
FWIW, I don't have a spare machine where I can fully install 1.9, but I pointed my LD_LIBRARY_PATH at a release build of the 1.9 OpenCL, HSA, and thunk shared objects and mdrun-test passed for me on gfx803. It says "1 GPU auto-selected for this run." so I assume it is running on the GPU. |
I still see it fail on my ROCm 1.9 system with a Vega (gfx900) and Fiji (gfx803) installed. OpenCL driver version 2679.0, so I believe this is a full 1.9 install. Currently unable to build a debug release of the OpenCL runtime to get symbols, or I'd point where the issue is coming up for me. |
Do you know if it is trying to build programs for both devices? Maybe the build is faulting when trying to build for vega? |
Just putting my commands down here so I don't need to arrow-up every time I want to run this test. :) mkdir -p ~/gromacs_test/
cd ~/gromacs_test/
wget https://gerrit.gromacs.org/changes/7810/revisions/f199e29cc958c00bd1481e710d9abdd0d36ae0f9/archive?format=tbz2
mv archive\?format\=tbz2 gromacs.tar.bz2
tar -xf gromacs.tar.bz2
SOURCE_DIR=$(pwd)
mkdir build
cd build
cmake $SOURCE_DIR -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DGMX_BUILD_OWN_FFTW=ON
make -j `nproc` mdrun-test
cd bin
./mdrun-test Re-tested on ROCm 1.9 on a system with only Polaris 10 (gfx803):
gdb backtrace (no symbols in libamdocl64.so at the moment)
|
Sigh. I really do not like rpath. My build of libgromacs.so.4.0.0 has an rpath pointing to the directory containing my internal opencl bits. I see the segv now after hitting it with "chrpath -d". |
So the good news and bad news for @pszi1ard:
I don't think we will be able to to give you a solid timeline for when this will make it into an external release, as we are still working out patches may enter into any bugfix point release in 1.9.x. |
Thanks for the feedback @b-sumner and @jlgreathouse. I appreciate that the planning for 1.9.x is ongoing. Is there a (reasonably straightforward!) way for me to build and replace the rocm-opencl packages so I can keep using ROCm? Do you happen to have a suggestion for a non-invasive transformation on the code to tickle the compiler and avoid the segv? Otherwise, the side-effect of putting aside any testing/dev with ROCm right now is that, no GROMACS development and testing can be done for the upcoming 2019 release (change freeze in about a month) on ROCm until fixes land. This poses the risk that our next release will not work at all on ROCm or at least we'l have to keep warning users against using ROCm at worst and untested / un-tuned code at best if a fix comes before the final release later this year. Side-note:
but not much more. You can avoid this by passing |
The 1.9 compiler was branched on June 21. The failure being hit is in a "register coalescer" which was subsequently updated after the branch. I'm not really sure how to perturb the code to affect something that deep. One possibility might be to reduce each component separately instead of all 3 at once. |
Thanks for the tip. Unfortunately it did not work. Strange thing is that if I completely the 2nd, conditional atomic reduction (also concurrent of three values), the crash is gone. However, if I issue both the in-loop and outside-of-loop atomic ops sequentially, I still get the crash. Any other tips? :) On a different note: Is it a reasonably straightforward thing (and workable idea) to build a rocm-opencl deb package from source and replace the current one? Where would I start -- is the bug fixed in the 1.9 release branch? |
Hi @pszi1ard IMO, it's pretty easy to build a custom install of the OpenCL runtime. See my post in this other issue which includes a shell script that will do basically everything for you. You might have to play around with it to build a .deb package -- to properly cpack the results, for example. You might also want to change the build to Release from ReleaseWithDebInfo. You could pull the pre- and post-install files out of the existing ROCm OpenCL .deb file if you want to make things slightly easier. |
I'll have to defer to @b-sumner about whether this bug is fixed in in the 1.9.0 source code release branch of the open source OpenCL runtime. I believe our original intent was that the source code release would match the source we used to build the .deb releases. That said, I see that the roc-1.9.x branch of LLVM that the OpenCL build direction pulls from had some patches regarding register allocation brought into it last week. I don't know if this was meant to fix the issue being raised here or not. |
It seems like this missed the 1.9.1 too, right? Any plans to release a new rocm-opencl? |
There likely won't be a major update to rocm-opencl until ROCm 2.0. |
That's very unfortunate. Is there an approx ETA for ROCm 2.0 (I know it has just been announced, but announcement != release and we need to advise out users whether to stay clear of ROCm or not) ? |
I apologize for the long delay on this. I believe the bugfix is part of a larger series of changes in the compiler. We didn't want to bring new functionality into a 1.9.x point release, but it also would have been difficult to cherry pick this individual fix back into the 1.9.x code base. I believe that our target is for 2.0 to be out by the end of the year, but i'm not sure if AMD has made a public announcement on an official exact date. |
@jlgreathouse OK, I understand that the cost of backporting a fix is too high to address this issue. End of the year would be great, I hope I can get a confirmation soon. |
We are getting close to our final release and at the moment 1.9 is still not working. I've seen some links to 2.0 beta rpms floating around on the Internet, but I'm not aware of a deb repo. Have your internal testing covered my bog report? Do you have 2.0 beta debs available? With the latest 1.9.x, additionall to the previous 100% reproducible segv, I also see a clBuildProgram segv in ~0.1-0.2% of the compilations when building mostly (only?) when building clFFT. Do you know of such issues? |
Hi @pszi1ard Do you have a preferred contact mechanism? I can send you a link to our 2.0 beta repos, but I would prefer not to post it publicly. |
Email. [email protected] |
PS: thanks. I'll also try to coordinate with the internal team that does GROMACS validation (although they seem to have focused on the previous release which works fine) |
The following change that only does code refectoring of the GROMACS OpenCL kernels causes the OpenCL compiler to crash:
https://gerrit.gromacs.org/#/c/7810/19/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl_kernel_utils.clh
The culprit has been isolated to the linked changes on line 675-677, the local memory stores that have been moved from the collar into the reduction function in question. If these three lines are commented out, the compilation succeeds.
The text was updated successfully, but these errors were encountered: