You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered a simple case where (I believe) the compiler applies an illegal loop fission, leading to a deadlock on execution. A very simple reproducer has thread 1 waiting for a value written by thread 0.
Source code:
#include<iostream>
#include<hip/hip_runtime.h>
#include<thrust/device_vector.h>
__global__ voidkernel(int* values)
{
if (threadIdx.x >= 2) return;
int i = threadIdx.x;
int value = -1;
while (value == -1) {
value = __hip_atomic_load(values + i, __ATOMIC_ACQUIRE, __HIP_MEMORY_SCOPE_AGENT);
if (value != -1) {
__hip_atomic_store(values + i + 1, value + 1, __ATOMIC_RELEASE, __HIP_MEMORY_SCOPE_AGENT);
}
}
}
intmain()
{
thrust::device_vector<int> v(3, -1);
v[0] = 0;
kernel<<<1, 2>>>(v.data().get());
hipDeviceSynchronize();
std::cout << v[0] << v[1] << v[2] << '\n';
}
I believe the use of acquire/release atomics is not necessary here, since this is not a question of visibility of changes, but a question of forward progress, I just added them to be sure.
The optimizations correctly recognize that when the inner if condition is reached, the loop can be exited, but it does separate the while(value == -1) value = load(...) section into a separate loop, which prevents forward-progress for the rest of the loop.
The loop consists entirely of the load of value, with the store to values happening afterwards, while the code as written executes it as part of the loop.
For comparison, executing the equivalent code on an NVIDIA GPU without independent thread scheduling (P100) does not lead to this deadlock behavior, and correct output 012
Problem Description
I encountered a simple case where (I believe) the compiler applies an illegal loop fission, leading to a deadlock on execution. A very simple reproducer has thread 1 waiting for a value written by thread 0.
Source code:
I believe the use of acquire/release atomics is not necessary here, since this is not a question of visibility of changes, but a question of forward progress, I just added them to be sure.
The compiler translates this to
The optimizations correctly recognize that when the inner
if
condition is reached, the loop can be exited, but it does separate thewhile(value == -1) value = load(...)
section into a separate loop, which prevents forward-progress for the rest of the loop.The loop consists entirely of the load of
value
, with the store tovalues
happening afterwards, while the code as written executes it as part of the loop.For comparison, executing the equivalent code on an NVIDIA GPU without independent thread scheduling (P100) does not lead to this deadlock behavior, and correct output
012
Operating System
Rocky Linux 9.5 (Blue Onyx)
CPU
AMD EPYC 7713 64-Core Processor
GPU
AMD MI50 (gfx906, amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-)
ROCm Version
ROCm 6.3.2
ROCm Component
llvm-project
Steps to Reproduce
Compile the code with
hipcc test.hip.cpp --save-temps
, execute the output./a.out
and inspect the generated assembly.(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Additional Information
No response
The text was updated successfully, but these errors were encountered: