Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIP Kernel Compiler Issue #18

Open
nikhil-tensorwave opened this issue Jun 6, 2024 · 2 comments
Open

HIP Kernel Compiler Issue #18

nikhil-tensorwave opened this issue Jun 6, 2024 · 2 comments

Comments

@nikhil-tensorwave
Copy link

When building OpenMM-HIP and running make test I am running into HIP compiler errors.
These errors are of the type

Error creating kernel <kernel function name>: hipErrorNotFound (500)

I'm also getting

Error launching HIP compiler: 256

Runtime environment:
ROCm 6.1.1
Ubuntu 22.04
Python 3.10
PyTorch 2.4.0

These were the setup steps used:

## build openmm
git clone https://github.com/openmm/openmm.git
git checkout 8.1.1
cd openmm
mkdir -p build/install
cd build
cmake ../ -D CMAKE_INSTALL_PREFIX=./install -D PYTHON_EXECUTABLE=/usr/bin/python3 -D OPENMM_BUILD_COMMON=ON -D OPENMM_PYTHON_USER_INSTALL=OFF -D CMAKE_CXX_FLAGS_RELEASE="-O3 -DNDEBUG -D_GLIBCXX_USE_CXX11_ABI=0"
make -j128
make test
make install
cd ../..

## build openmm-hip
git clone https://github.com/amd/openmm-hip.git
cd openmm-hip
git checkout mi300_changes  # necessary for ROCm 6.0!
mkdir build && cd build
cmake ../ -D OPENMM_DIR=../../openmm/build/install -D OPENMM_SOURCE_DIR=../../openmm -D CMAKE_INSTALL_PREFIX=../../openmm/build/install -D CMAKE_CXX_FLAGS_RELEASE="-O3 -DNDEBUG -D_GLIBCXX_USE_CXX11_ABI=0"
make -j128
make test  # these mostly fail with above errors
ctest -j 128 --rerun-failed  # if you keep rerunning them, more and more pass

When rerunning the make tests, a small percentage will pass.

Any help on this would be appreciated.

@ex-rzr
Copy link
Contributor

ex-rzr commented Jun 7, 2024

I've never seen such errors.

Can you check with this branch #14 (https://github.com/StreamHPC/openmm-hip/tree/develop_stream)?

Also can your run ctest without -j (ctest --output-on-failure)? Perhaps something is wrong with concurrent compilation/running.

Btw, what GPUs do you use?

@nikhil-tensorwave
Copy link
Author

Switching to that branch and adding gfx942 to the list of GPU architectures fixed the issue! Thank you very much for the help. Also, we're running on Mi300Xs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants