You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to use cuStreamWriteValue32 which is part of the cuda driver API (context: #3894). Even though I can build, I am getting a runtime error CUDA_ERROR_NOT_SUPPORTED. This should be supported as I am using a DGX H100 node with cuda 12.8, inside the pjnl latest docker.
Repro:
Driver Version: 550.90.07 CUDA Version: 12.8 (I also tried with more recent drivers)
The source of problem can be narrowed down to lazy loading /usr/local/cuda/compat/lib.real/libcuda.so.1 in the pjnl container -- the bug comes either from lazy loading or from the library itself.
To prove this, note that the following patch (which explicitly links to cuda, non-lazily) solves the bug:
and note also that cuda-gdb gives the following backtrace of the error:
#0 0x00007fff37f740f0 in cudbgReportDriverApiError () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#1 0x00007fff381e312b in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#2 0x00007fff2f4c0d47 in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#3 0x00007fff2f49c29e in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#4 0x00007fff2f4af56d in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#5 0x00007fff2f5aebd6 in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#6 0x00007fff380c05d0 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#7 0x0000555555a67b3e in lazilyLoadAndInvoke (args#0=0x7fff2ad0d618, args#1=140724802682880, args#2=3, args#3=0) at /opt/pytorch/Fuser2/csrc/driver_api.cpp:95
The text was updated successfully, but these errors were encountered:
related Team's thread
I am trying to use
cuStreamWriteValue32
which is part of the cuda driver API (context: #3894). Even though I can build, I am getting a runtime errorCUDA_ERROR_NOT_SUPPORTED
. This should be supported as I am using a DGX H100 node with cuda 12.8, inside the pjnl latest docker.Repro:
gitlab-master.nvidia.com:5005/dl/pytorch/update-scripts:pjnl-latest
Driver Version: 550.90.07 CUDA Version: 12.8
(I also tried with more recent drivers)The source of problem can be narrowed down to lazy loading
/usr/local/cuda/compat/lib.real/libcuda.so.1
in the pjnl container -- the bug comes either from lazy loading or from the library itself.To prove this, note that the following patch (which explicitly links to cuda, non-lazily) solves the bug:
and note also that cuda-gdb gives the following backtrace of the error:
The text was updated successfully, but these errors were encountered: