Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clBuildProgram segv #71

Open
pszi1ard opened this issue Sep 13, 2018 · 33 comments
Open

clBuildProgram segv #71

pszi1ard opened this issue Sep 13, 2018 · 33 comments

Comments

@pszi1ard
Copy link

The following change that only does code refectoring of the GROMACS OpenCL kernels causes the OpenCL compiler to crash:
https://gerrit.gromacs.org/#/c/7810/19/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl_kernel_utils.clh

The culprit has been isolated to the linked changes on line 675-677, the local memory stores that have been moved from the collar into the reduction function in question. If these three lines are commented out, the compilation succeeds.

@gstoner
Copy link
Contributor

gstoner commented Sep 13, 2018

I am asking the team to look into this.

G

@b-sumner
Copy link
Collaborator

b-sumner commented Sep 13, 2018

Where can I get the exact kernel source, dependencies, and compiler options needed to reproduce the problem?

@pszi1ard
Copy link
Author

The source code is here:
https://gerrit.gromacs.org/changes/7810/revisions/f199e29cc958c00bd1481e710d9abdd0d36ae0f9/archive?format=tbz2

Warning: the review site serves a tar.bz2 which is a tarbomb (no root directory).

Extract, and from the build directory run:

cmake $SOURCE_DIR -DGMX_GPU=ON -DGMX_USE_OPENCL=ON &&\
make mdrun-test &&\ 
bin/mdrun-test

The mdrun-test unit test will segv as soon as it hits the compilation of the kernel(s) in src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl_kernel.clh.

@pszi1ard
Copy link
Author

As this code in question is about to pass code review and is about to be merged which will prevent me from testing with ROCm, I'd be thankful if you can suggest an easy work-around that I can use until the compiler issue is fixed.

@b-sumner
Copy link
Collaborator

I used ccmake to point precisely at the OpenCL_INCLUDE_DIR and OpenCL_LIBRARY that I want to use, and it tells me " OpenCL is not supported. OpenCL version 1.2 or newer is required."

Is it true that the INCLUDE_DIR I point to should contain a directory named CL containing cl.h... and that LIBRARY should be a file named libOpenCL.so? If so, what else does it want?

@pszi1ard
Copy link
Author

Is it true that the INCLUDE_DIR I point to should contain a directory named CL containing cl.h... and that LIBRARY should be a file named libOpenCL.so?

That should be enough. It seems to get rid of the "sticky" error you need by starting over with a clean cache (pass -DOpenCL_INCLUDE_DIR and -DOpenCL_LIBRARY to cmake)

@b-sumner
Copy link
Collaborator

Thanks. I am able to build and run.

What kind of device are you seeing the problem on? I am running using a debug build of the tip compiler on gfx803 and it looks like it's going to pass.

What versiion of ROCm are you running?

@b-sumner
Copy link
Collaborator

...
[ RUN ] MdrunCanWrite/NptTrajectories.WithDifferentPcoupl/2

NOTE 1 [file /tmp/gro/build/src/programs/mdrun/tests/Testing/Temporary/MdrunCanWrite_NptTrajectories_WithDifferentPcoupl_2_input.mdp, line 13]:
/tmp/gro/build/src/programs/mdrun/tests/Testing/Temporary/MdrunCanWrite_NptTrajectories_WithDifferentPcoupl_2_input.mdp did not specify a value for the .mdp option "cutoff-scheme". Probably it
was first intended for use with GROMACS before 4.6. In 4.6, the Verlet
scheme was introduced, but the group scheme was still the default. The
default is now the Verlet scheme, so you will observe different behaviour.

Setting the LD random seed to 965332988
Generated 279 of the 1225 non-bonded parameter combinations
Excluding 2 bonded neighbours molecule type 'Methanol'
Excluding 2 bonded neighbours molecule type 'SOL'
Removing all charge groups because cutoff-scheme=Verlet
Number of degrees of freedom in T-Coupling group System is 12.00
Determining Verlet buffer for a tolerance of 0.005 kJ/mol/ps at 298 K
Calculated rlist for 1x1 atom pair-list as 1.025 nm, buffer size 0.025 nm
Set rlist, assuming 4x4 atom pair-list, to 1.022 nm, buffer size 0.022 nm
Note that mdrun will redetermine rlist based on the actual pair-list setup

NOTE 2 [file /tmp/gro/build/src/programs/mdrun/tests/Testing/Temporary/MdrunCanWrite_NptTrajectories_WithDifferentPcoupl_2_input.mdp]:
You are using a plain Coulomb cut-off, which might produce artifacts.
You might want to consider using PME electrostatics.

This run will generate roughly 0 Mb of data

There were 2 notes
Reading file /tmp/gro/build/src/programs/mdrun/tests/Testing/Temporary/MdrunCanWrite_NptTrajectories_WithDifferentPcoupl_2.tpr, VERSION 2019-dev (single precision)
Changing nstlist from 10 to 100, rlist from 1.022 to 1.373

Using 1 MPI thread
Using 1 OpenMP thread

1 GPU auto-selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
PP:0

NOTE: Thread affinity was not set.
starting mdrun 'spc-and-methanol'
2 steps, 0.0 ps.

Writing final coordinates.

           Core t (s)   Wall t (s)        (%)
   Time:        0.068        0.068      100.0
             (ns/day)    (hour/ns)

Performance: 3.786 6.339
[ OK ] MdrunCanWrite/NptTrajectories.WithDifferentPcoupl/2 (28402 ms)
[----------] 3 tests from MdrunCanWrite/NptTrajectories (85248 ms total)

[----------] Global test environment tear-down
[==========] 27 tests from 11 test cases ran. (817537 ms total)
[ PASSED ] 27 tests.

YOU HAVE 45 DISABLED TESTS

@pszi1ard
Copy link
Author

What kind of device are you seeing the problem on?

gfx803 and gfx900.

I am running using a debug build of the tip compiler on gfx803 and it looks like it's going to pass.

OK, but not sure what does that tell us?
Wouldn't it still be useful to know if you can repro the issue. Is there a 1.8 patch release planned? Otherwise, ETA for 1.9?

What versiion of ROCm are you running?

$ dpkg -l | grep "rocm-"
ii  rocm-clang-ocl                         0.3.0-7997136                              amd64        OpenCL compilation with clang compiler.
ii  rocm-opencl                            1.2.0-2018082755                           amd64        OpenCL/ROCm
ii  rocm-opencl-dev                        1.2.0-2018082755                           amd64        OpenCL/ROCm
ii  rocm-smi                               1.0.0-46-g81ef66f                          amd64        System Management Interface for ROCm
ii  rocm-utils                             1.8.199                                    amd64        Radeon Open Compute (ROCm) Runtime software stack

@b-sumner
Copy link
Collaborator

ROCm/ROCm#404 (comment) says 1.9 will be releasing very soon. Since the problem is not showing up with the tip compiler, your issue was fixed sometime after 1.8 released. Hopefully it was picked up in 1.9.

@pszi1ard
Copy link
Author

OK, looking forward to seeing the 1.9 not crash, but admittedly I'd be more relieved if somebody confirmed that the release branch is in fact fixed.

(Unrelated, but I'm hoping that 1.8 debs won't get pulled so I can down- and upgrade freely.)

@pszi1ard
Copy link
Author

OK, looking forward to seeing the 1.9 not crash, but admittedly I'd be more relieved if somebody confirmed that the release branch is in fact fixed.

Though if 1.9 is indeed dropping today, it won't be a long wait.

@pszi1ard
Copy link
Author

After updating the toolchain to ROCm 1.9, I am still getting a clBuildProgram() segfault, so unfortunately this seems to have fallen through the cracks.

How long until the next patch release?

@b-sumner
Copy link
Collaborator

FWIW, I don't have a spare machine where I can fully install 1.9, but I pointed my LD_LIBRARY_PATH at a release build of the 1.9 OpenCL, HSA, and thunk shared objects and mdrun-test passed for me on gfx803. It says "1 GPU auto-selected for this run." so I assume it is running on the GPU.

@jlgreathouse
Copy link
Contributor

I still see it fail on my ROCm 1.9 system with a Vega (gfx900) and Fiji (gfx803) installed. OpenCL driver version 2679.0, so I believe this is a full 1.9 install.

Currently unable to build a debug release of the OpenCL runtime to get symbols, or I'd point where the issue is coming up for me.

@b-sumner
Copy link
Collaborator

Do you know if it is trying to build programs for both devices? Maybe the build is faulting when trying to build for vega?

@jlgreathouse
Copy link
Contributor

Just putting my commands down here so I don't need to arrow-up every time I want to run this test. :)

mkdir -p ~/gromacs_test/
cd ~/gromacs_test/
wget https://gerrit.gromacs.org/changes/7810/revisions/f199e29cc958c00bd1481e710d9abdd0d36ae0f9/archive?format=tbz2
mv archive\?format\=tbz2 gromacs.tar.bz2
tar -xf gromacs.tar.bz2
SOURCE_DIR=$(pwd)
mkdir build
cd build
cmake $SOURCE_DIR -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DGMX_BUILD_OWN_FFTW=ON
make -j `nproc` mdrun-test
cd bin
./mdrun-test

Re-tested on ROCm 1.9 on a system with only Polaris 10 (gfx803):

$ rocm_agent_enumerator
gfx000
gfx803
$ dkms status
amdgpu, 1.9-211, 4.15.0-34-generic, x86_64: installed
$ clinfo | grep Driver
  Driver version:                                2679.0 (HSA1.1,LC)
./mdrun-test
...
1 GPU auto-selected for this run.
Mapping of GPU IDs to the 1 GPU task in the 1 rank on this node:
  PP:0

NOTE: Thread affinity was not set.
Segmentation fault (core dumped)

gdb backtrace (no symbols in libamdocl64.so at the moment)

Thread 1 "mdrun-test" received signal SIGSEGV, Segmentation fault.
0x00007fffeead21ed in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
(gdb) bt
#0  0x00007fffeead21ed in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#1  0x00007fffeead4b1a in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#2  0x00007fffeebbb1cd in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#3  0x00007fffeead3f35 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#4  0x00007fffeead61f8 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#5  0x00007fffeead8a1b in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#6  0x00007fffeead9096 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#7  0x00007fffeec7b227 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#8  0x00007fffef5a506a in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#9  0x00007fffef2a29df in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#10 0x00007fffef5a5a7e in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#11 0x00007fffed4fa465 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#12 0x00007fffed4fc94d in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#13 0x00007fffed4f11ac in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#14 0x00007fffed7f7f3e in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#15 0x00007fffed8051dd in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#16 0x00007fffed4e07da in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#17 0x00007fffed4e9522 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#18 0x00007fffed4e9938 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#19 0x00007fffed3b5b0d in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#20 0x00007fffed3e0051 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#21 0x00007fffed3b3ccf in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#22 0x00007fffed3a5b3e in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#23 0x00007fffed37bcb9 in clBuildProgram () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#24 0x00007ffff6fb5714 in gmx::ocl::compileProgram(_IO_FILE*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, _cl_context*, _cl_device_id*, ocl_vendor_id_t) () from /home/jgreatho/gromacs/build/bin/../lib/libgromacs.so.4
#25 0x00007ffff6f5f2bb in nbnxn_gpu_compile_kernels(gmx_nbnxn_ocl_t*) () from /home/jgreatho/gromacs/build/bin/../lib/libgromacs.so.4
#26 0x00007ffff6f5cb8a in nbnxn_gpu_init(gmx_nbnxn_ocl_t**, gmx_device_info_t const*, interaction_const_t const*, NbnxnListParameters const*, nbnxn_atomdata_t const*, int, bool) ()
   from /home/jgreatho/gromacs/build/bin/../lib/libgromacs.so.4
#27 0x00007ffff6e9695c in init_forcerec(_IO_FILE*, gmx::MDLogger const&, t_forcerec*, t_fcdata*, t_inputrec const*, gmx_mtop_t const*, t_commrec const*, float (*) [3], char const*, char const*, gmx::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, gmx_hw_info_t const&, gmx_device_info_t const*, bool, float) () from /home/jgreatho/gromacs/build/bin/../lib/libgromacs.so.4
#28 0x00007ffff6feade7 in gmx::Mdrunner::mdrunner() () from /home/jgreatho/gromacs/build/bin/../lib/libgromacs.so.4
#29 0x00005555555cb9e8 in gmx::Mdrunner::mainFunction(int, char**) ()
#30 0x00005555555cc8ef in gmx_mdrun(int, char**) ()
#31 0x00005555555be6ff in gmx::test::SimulationRunner::callMdrun(gmx::test::CommandLine const&) ()
#32 0x0000555555586a03 in gmx::test::ImdTest_ImdCanRun_Test::TestBody() ()
#33 0x000055555564bdba in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#34 0x000055555563e761 in testing::Test::Run() [clone .part.533] ()
#35 0x000055555563f1f5 in testing::TestInfo::Run() [clone .part.534] ()
#36 0x000055555563f535 in testing::TestCase::Run() [clone .part.535] ()
#37 0x0000555555641d35 in testing::internal::UnitTestImpl::RunAllTests() [clone .part.549] ()
#38 0x0000555555642192 in testing::UnitTest::Run() ()
#39 0x0000555555573089 in main ()

@b-sumner
Copy link
Collaborator

Sigh. I really do not like rpath. My build of libgromacs.so.4.0.0 has an rpath pointing to the directory containing my internal opencl bits. I see the segv now after hitting it with "chrpath -d".

@jlgreathouse
Copy link
Contributor

So the good news and bad news for @pszi1ard:

  • Bad news: This bug isn't fixed in the public OpenCL release as of ROCm 1.9.
  • Good news: as our confusion may have demonstrated, we definitely have this bug fixed internally. So all hope is not lost. :)

I don't think we will be able to to give you a solid timeline for when this will make it into an external release, as we are still working out patches may enter into any bugfix point release in 1.9.x.

@pszi1ard
Copy link
Author

Thanks for the feedback @b-sumner and @jlgreathouse.

I appreciate that the planning for 1.9.x is ongoing. Is there a (reasonably straightforward!) way for me to build and replace the rocm-opencl packages so I can keep using ROCm?

Do you happen to have a suggestion for a non-invasive transformation on the code to tickle the compiler and avoid the segv?

Otherwise, the side-effect of putting aside any testing/dev with ROCm right now is that, no GROMACS development and testing can be done for the upcoming 2019 release (change freeze in about a month) on ROCm until fixes land. This poses the risk that our next release will not work at all on ROCm or at least we'l have to keep warning users against using ROCm at worst and untested / un-tuned code at best if a fix comes before the final release later this year.

Side-note:
@b-sumner Indeed gmx (and libgromacs.so) produced relocatable binaries

  RPATH                $ORIGIN/../lib

but not much more. You can avoid this by passing -DCMAKE_SKIP_RPATH.

@b-sumner
Copy link
Collaborator

The 1.9 compiler was branched on June 21. The failure being hit is in a "register coalescer" which was subsequently updated after the branch. I'm not really sure how to perturb the code to affect something that deep. One possibility might be to reduce each component separately instead of all 3 at once.

@pszi1ard
Copy link
Author

One possibility might be to reduce each component separately instead of all 3 at once.

Thanks for the tip. Unfortunately it did not work. Strange thing is that if I completely the 2nd, conditional atomic reduction (also concurrent of three values), the crash is gone. However, if I issue both the in-loop and outside-of-loop atomic ops sequentially, I still get the crash. Any other tips? :)

On a different note: Is it a reasonably straightforward thing (and workable idea) to build a rocm-opencl deb package from source and replace the current one? Where would I start -- is the bug fixed in the 1.9 release branch?

@jlgreathouse
Copy link
Contributor

Hi @pszi1ard

IMO, it's pretty easy to build a custom install of the OpenCL runtime. See my post in this other issue which includes a shell script that will do basically everything for you. You might have to play around with it to build a .deb package -- to properly cpack the results, for example. You might also want to change the build to Release from ReleaseWithDebInfo.

You could pull the pre- and post-install files out of the existing ROCm OpenCL .deb file if you want to make things slightly easier.

@jlgreathouse
Copy link
Contributor

I'll have to defer to @b-sumner about whether this bug is fixed in in the 1.9.0 source code release branch of the open source OpenCL runtime. I believe our original intent was that the source code release would match the source we used to build the .deb releases.

That said, I see that the roc-1.9.x branch of LLVM that the OpenCL build direction pulls from had some patches regarding register allocation brought into it last week. I don't know if this was meant to fix the issue being raised here or not.

@pszi1ard
Copy link
Author

pszi1ard commented Nov 6, 2018

It seems like this missed the 1.9.1 too, right? Any plans to release a new rocm-opencl?

@jlgreathouse
Copy link
Contributor

There likely won't be a major update to rocm-opencl until ROCm 2.0.

@pszi1ard
Copy link
Author

pszi1ard commented Nov 6, 2018

That's very unfortunate. Is there an approx ETA for ROCm 2.0 (I know it has just been announced, but announcement != release and we need to advise out users whether to stay clear of ROCm or not) ?

@jlgreathouse
Copy link
Contributor

I apologize for the long delay on this. I believe the bugfix is part of a larger series of changes in the compiler. We didn't want to bring new functionality into a 1.9.x point release, but it also would have been difficult to cherry pick this individual fix back into the 1.9.x code base.

I believe that our target is for 2.0 to be out by the end of the year, but i'm not sure if AMD has made a public announcement on an official exact date.

@pszi1ard
Copy link
Author

pszi1ard commented Nov 6, 2018

@jlgreathouse OK, I understand that the cost of backporting a fix is too high to address this issue.

End of the year would be great, I hope I can get a confirmation soon.

@pszi1ard
Copy link
Author

We are getting close to our final release and at the moment 1.9 is still not working. I've seen some links to 2.0 beta rpms floating around on the Internet, but I'm not aware of a deb repo. Have your internal testing covered my bog report? Do you have 2.0 beta debs available?

With the latest 1.9.x, additionall to the previous 100% reproducible segv, I also see a clBuildProgram segv in ~0.1-0.2% of the compilations when building mostly (only?) when building clFFT. Do you know of such issues?

@jlgreathouse
Copy link
Contributor

Hi @pszi1ard

Do you have a preferred contact mechanism? I can send you a link to our 2.0 beta repos, but I would prefer not to post it publicly.

@pszi1ard
Copy link
Author

Email. [email protected]
debs would be much appreciated. Also a rough timeline if possible (even if under NDA, which I am covered by) so we know what deadlines to not miss.

@pszi1ard
Copy link
Author

PS: thanks. I'll also try to coordinate with the internal team that does GROMACS validation (although they seem to have focused on the previous release which works fine)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants