Potential deadlock during training #2805

nogilnick · 2025-01-16T02:57:25Z

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

2.17.0

Custom code

No

OS platform and distribution

Ubuntu 22.04

Mobile device

No response

Python version

3.12.7

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

Using Radeon RX 6400 with rocm-6.3.1. Ubuntu 22.04. Installed tensorflow-rocm via:

pip install tensorflow-rocm==2.17.0 -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.3/ --upgrade

Training several different simple models in Keras.

Training mostly work well, but occasionally hit what appears to be a deadlock: training progress stops and GPU seems to start idling. This happens fairly intermittently. Sometimes I can run for hours without hitting it.

Attached in GDB and it looks like all threads are waiting on __futex_abstimed_wait_common64

Standalone code to reproduce the issue

Basic TF Keras classifier.

Relevant log output

(gdb) info thread
  Id   Target Id                                 Frame 
* 1    Thread 0x77f78aa024c0 (LWP 8154) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  2    Thread 0x77f6d8e00640 (LWP 8155) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x77f6dbdcade0 <thread_status+96>)
    at ./nptl/futex-internal.c:57
  3    Thread 0x77f6d8400640 (LWP 8156) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x77f6dbdcae60 <thread_status+224>)
    at ./nptl/futex-internal.c:57
  4    Thread 0x77f6d7a00640 (LWP 8157) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x77f6dbdcaee0 <thread_status+352>)
    at ./nptl/futex-internal.c:57
  5    Thread 0x77f6d3000640 (LWP 8158) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x77f6dbdcaf60 <thread_status+480>)
    at ./nptl/futex-internal.c:57
  6    Thread 0x77f6d0600640 (LWP 8159) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x77f6dbdcafe0 <thread_status+608>)
    at ./nptl/futex-internal.c:57
  7    Thread 0x77f6cdc00640 (LWP 8160) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x77f6dbdcb060 <thread_status+736>)
    at ./nptl/futex-internal.c:57
  8    Thread 0x77f6cb200640 (LWP 8161) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x77f6dbdcb0e0 <thread_status+864>)
    at ./nptl/futex-internal.c:57
  9    Thread 0x77f6b8400640 (LWP 8165) "python" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
  10   Thread 0x77f5afe00640 (LWP 8167) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ae364d8) at ./nptl/futex-internal.c:57
  11   Thread 0x77f5af400640 (LWP 8168) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ae36558) at ./nptl/futex-internal.c:57
  12   Thread 0x77f5aea00640 (LWP 8169) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ae365d8) at ./nptl/futex-internal.c:57
  13   Thread 0x77f5ae000640 (LWP 8170) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ae36658) at ./nptl/futex-internal.c:57
  14   Thread 0x77f5ad600640 (LWP 8171) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ae366d8) at ./nptl/futex-internal.c:57
  15   Thread 0x77f5acc00640 (LWP 8172) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ae36758) at ./nptl/futex-internal.c:57
  16   Thread 0x77f571200640 (LWP 8173) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ae367d8) at ./nptl/futex-internal.c:57
  17   Thread 0x77f56be00640 (LWP 8174) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ae36858) at ./nptl/futex-internal.c:57
  18   Thread 0x77f56b400640 (LWP 8175) "python" __GI___ioctl (fd=3, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
  19   Thread 0x77f55fe00640 (LWP 8176) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4b9516d8) at ./nptl/futex-internal.c:57
  20   Thread 0x77f55f400640 (LWP 8177) "python" 0x000077f78a6e57f8 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x77f55f3ffce0, rem=0x77f55f3ffce0)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
  21   Thread 0x77f55ea00640 (LWP 8178) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ba93bd8) at ./nptl/futex-internal.c:57
  22   Thread 0x77f55e000640 (LWP 8179) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ba93c58) at ./nptl/futex-internal.c:57
  23   Thread 0x77f55d600640 (LWP 8180) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ba93cd8) at ./nptl/futex-internal.c:57
  24   Thread 0x77f55cc00640 (LWP 8181) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ba93d5c) at ./nptl/futex-internal.c:57
  25   Thread 0x77f553e00640 (LWP 8182) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ba93dd8) at ./nptl/futex-internal.c:57
  26   Thread 0x77f553400640 (LWP 8183) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ba93e58) at ./nptl/futex-internal.c:57
  27   Thread 0x77f552a00640 (LWP 8184) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ba93edc) at ./nptl/futex-internal.c:57
  28   Thread 0x77f552000640 (LWP 8185) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4ba93f5c) at ./nptl/futex-internal.c:57
  29   Thread 0x77f551600640 (LWP 8186) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4bb486d8) at ./nptl/futex-internal.c:57
  30   Thread 0x77f547800640 (LWP 8397) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4b5dd6dc) at ./nptl/futex-internal.c:57
  31   Thread 0x77f546e00640 (LWP 8398) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4b5dd75c) at ./nptl/futex-internal.c:57
  32   Thread 0x77f546400640 (LWP 8399) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4b5dd7d8) at ./nptl/futex-internal.c:57
  33   Thread 0x77f545a00640 (LWP 8400) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4b5dd85c) at ./nptl/futex-internal.c:57
  34   Thread 0x77f545000640 (LWP 8401) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4b5dd8dc) at ./nptl/futex-internal.c:57
  35   Thread 0x77f537e00640 (LWP 8402) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4b5dd95c) at ./nptl/futex-internal.c:57
  36   Thread 0x77f537400640 (LWP 8403) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4b5dd9d8) at ./nptl/futex-internal.c:57
  37   Thread 0x77f536a00640 (LWP 8404) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x4b5dda58) at ./nptl/futex-internal.c:57
  38   Thread 0x77f536000640 (LWP 8405) "python" 0x000077f78a6e57f8 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x77f535fffaf0, rem=0x77f535fffaf0)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
  39   Thread 0x77f52f400640 (LWP 8410) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  40   Thread 0x77f52fe00640 (LWP 8411) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  41   Thread 0x77f534c00640 (LWP 8412) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  42   Thread 0x77f535600640 (LWP 8413) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  43   Thread 0x77f52ea00640 (LWP 8414) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  44   Thread 0x77f52e000640 (LWP 8416) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x77f504005dd8) at ./nptl/futex-internal.c:57
  45   Thread 0x77f52d600640 (LWP 8417) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x77f504005e58) at ./nptl/futex-internal.c:57
  46   Thread 0x77f52cc00640 (LWP 8418) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x77f504005ed8) at ./nptl/futex-internal.c:57
  47   Thread 0x77f523e00640 (LWP 8419) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x77f504005f58) at ./nptl/futex-internal.c:57
  48   Thread 0x77f523400640 (LWP 8420) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x77f504005fd8) at ./nptl/futex-internal.c:57
  49   Thread 0x77f522a00640 (LWP 8421) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x77f504006058) at ./nptl/futex-internal.c:57


Full stack trace of thread 8154:

#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x000077f78835f2e8 in absl::lts_20230802::synchronization_internal::FutexWaiter::WaitUntil(std::atomic<int>*, int, absl::lts_20230802::synchronization_internal::KernelTimeout) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#2  0x000077f78835f387 in absl::lts_20230802::synchronization_internal::FutexWaiter::Wait(absl::lts_20230802::synchronization_internal::KernelTimeout) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#3  0x000077f78835f59b in AbslInternalPerThreadSemWait_lts_20230802 () from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#4  0x000077f788360a2d in absl::lts_20230802::Mutex::Block(absl::lts_20230802::base_internal::PerThreadSynch*) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#5  0x000077f788360e01 in absl::lts_20230802::Mutex::LockSlowWithDeadline(absl::lts_20230802::MuHowS const*, absl::lts_20230802::Condition const*, absl::lts_20230802::synchronization_internal::KernelTimeout, int) () from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#6  0x000077f788360bb1 in absl::lts_20230802::Mutex::LockSlow(absl::lts_20230802::MuHowS const*, absl::lts_20230802::Condition const*, int) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#7  0x000077f788363780 in absl::lts_20230802::Notification::WaitForNotification() const ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#8  0x000077f787149e98 in tensorflow::ProcessFunctionLibraryRuntime::RunSync(tensorflow::FunctionLibraryRuntime::Options const&, unsigned long, absl::lts_20230802::Span<tensorflow::Tensor const>, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*) const () from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#9  0x000077f77246b5a8 in tensorflow::KernelAndDeviceFunc::Run(tensorflow::ScopedStepContainer*, tensorflow::EagerKernelArgs const&, std::vector<std::variant<tensorflow::Tensor, tensorflow::TensorShape>, std::allocator<std::variant<tensorflow::Tensor, tensorflow::TensorShape> > >*, tsl::CancellationManager*, std::optional<tensorflow::EagerFunctionParams> const&, std::optional<tensorflow::ManagedStackTrace> const&, tsl::CoordinationServiceAgent*) () from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2
#10 0x000077f77241777e in tensorflow::EagerKernelExecute(tensorflow::EagerContext*, absl::lts_20230802::InlinedVector<tensorflow::TensorHandle*, 4ul, std::allocator<tensorflow::TensorHandle*> > const&, std::optional<tensorflow::EagerFunctionParams> const&, tsl::core::RefCountPtr<tensorflow::KernelAndDevice> const&, tensorflow::GraphCollector*, tsl::CancellationManager*, absl::lts_20230802::Span<tensorflow::TensorHandle*>, std::optional<tensorflow::ManagedStackTrace> const&) () from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2
#11 0x000077f772420e8e in tensorflow::ExecuteNode::Run() () from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2
#12 0x000077f772466be4 in tensorflow::EagerExecutor::SyncExecute(tensorflow::EagerNode*) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2
#13 0x000077f772417229 in tensorflow::(anonymous namespace)::EagerLocalExecute(tensorflow::EagerOperation*, tensorflow::TensorHandle**, int*) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2
#14 0x000077f772414c79 in tensorflow::DoEagerExecute(tensorflow::EagerOperation*, tensorflow::TensorHandle**, int*) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2
#15 0x000077f772418448 in tensorflow::EagerExecute(tensorflow::EagerOperation*, tensorflow::TensorHandle**, int*) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2
#16 0x000077f7717bae07 in tensorflow::EagerOperation::Execute(absl::lts_20230802::Span<tensorflow::AbstractTensorHandle*>, int*) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2
#17 0x000077f7724653b3 in tensorflow::CustomDeviceOpHandler::Execute(tensorflow::ImmediateExecutionOperation*, tensorflow::ImmediateExecutionTensorHandle**, int*) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2
#18 0x000077f76fe76a65 in TFE_Execute () from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2
#19 0x000077f785b1f028 in TFE_Py_ExecuteCancelable(TFE_Context*, char const*, char const*, absl::lts_20230802::InlinedVector<TFE_TensorHandle*, 4ul, std::allocator<TFE_TensorHandle*> >*, _object*, TFE_CancellationManager*, absl::lts_20230802::InlinedVector<TFE_TensorHandle*, 2ul, std::allocator<TFE_TensorHandle*> >*, TSL_Status*) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../_pywrap_tensorflow_internal.so
#20 0x000077f6d93561f6 in tensorflow::TFE_Py_ExecuteCancelable_wrapper(pybind11::handle const&, char const*, char const*, pybind11::handle const&, pybind11::handle const&, tsl::CancellationManager*, pybind11::handle const&) () from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/_pywrap_tfe.so
#21 0x000077f6d9392f74 in pybind11::cpp_function::initialize<pybind11_init__pywrap_tfe(pybind11::module_&)::$_58, pybind11::object, pybind11::handle const&, char const*, char const*, pybind11::handle const&, pybind11::handle const&, pybind11::handle const&, pybind11::name, pybind11::scope, pybind11::sibling>(pybind11_init__pywrap_tfe(pybind11::module_&)::$_58&&, pybind11::object (*)(pybind11::handle const&, char const*, char const*, pybind11::handle const&, pybind11::handle const&, pybind11::handle const&), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#1}::__invoke(pybind11::detail::function_call&) () from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/_pywrap_tfe.so
#22 0x000077f6d936e418 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/_pywrap_tfe.so
#23 0x0000000000549d54 in cfunction_call (func=0x77f6d943ede0, args=0x77f5b53c7ee0, kwargs=0x0) at /usr/local/src/conda/python-3.12.7/Objects/methodobject.c:537
#24 0x000000000051af9b in _PyObject_MakeTpCall (tstate=0x9bfb70 <_PyRuntime+458992>, callable=0x77f6d943ede0, args=<optimized out>, nargs=<optimized out>, keywords=0x0)
    at /usr/local/src/conda/python-3.12.7/Objects/call.c:240
#25 0x0000000000525903 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x77f78a8928f0, throwflag=<optimized out>) at Python/bytecodes.c:2715
#26 0x0000000000575717 in _PyEval_EvalFrame (throwflag=0, frame=0x77f78a892750, tstate=0x9bfb70 <_PyRuntime+458992>) at /usr/local/src/conda/python-3.12.7/Include/internal/pycore_ceval.h:89
#27 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=0x0, func=0x77f6c5642f20, tstate=<optimized out>) at /usr/local/src/conda/python-3.12.7/Python/ceval.c:1683
#28 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=<optimized out>, func=0x77f6c5642f20) at /usr/local/src/conda/python-3.12.7/Objects/call.c:419
#29 _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x77f6c5642f20, tstate=0x9bfb70 <_PyRuntime+458992>)
    at /usr/local/src/conda/python-3.12.7/Include/internal/pycore_call.h:92
#30 method_vectorcall (method=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.7/Objects/classobject.c:91
#31 0x000000000052acad in PyCFunction_Call (kwargs=0x0, args=0x77f5b53a13f0, callable=0x77f3a4493fc0) at /usr/local/src/conda/python-3.12.7/Objects/call.c:387
#32 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x77f78a8926d8, throwflag=<optimized out>) at Python/bytecodes.c:3263
#33 0x0000000000575717 in _PyEval_EvalFrame (throwflag=0, frame=0x77f78a892448, tstate=0x9bfb70 <_PyRuntime+458992>) at /usr/local/src/conda/python-3.12.7/Include/internal/pycore_ceval.h:89
--Type <RET> for more, q to quit, c to continue without paging--
#34 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=0x0, func=0x77f6c48dee80, tstate=<optimized out>) at /usr/local/src/conda/python-3.12.7/Python/ceval.c:1683
#35 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=<optimized out>, func=0x77f6c48dee80) at /usr/local/src/conda/python-3.12.7/Objects/call.c:419
#36 _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x77f6c48dee80, tstate=0x9bfb70 <_PyRuntime+458992>)
    at /usr/local/src/conda/python-3.12.7/Include/internal/pycore_call.h:92
#37 method_vectorcall (method=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.12.7/Objects/classobject.c:91
#38 0x000000000052acad in PyCFunction_Call (kwargs=0x77f3a4490a00, args=0x77f6b7b82e30, callable=0x77f346c41140) at /usr/local/src/conda/python-3.12.7/Objects/call.c:387
#39 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x77f78a892370, throwflag=<optimized out>) at Python/bytecodes.c:3263
#40 0x000000000051db07 in _PyEval_EvalFrame (throwflag=0, frame=0x77f78a8922d0, tstate=0x9bfb70 <_PyRuntime+458992>) at /usr/local/src/conda/python-3.12.7/Include/internal/pycore_ceval.h:89
#41 _PyEval_Vector (kwnames=0x0, argcount=<optimized out>, args=0x7ffdb8f07090, locals=0x0, func=0x77f6c48dede0, tstate=0x9bfb70 <_PyRuntime+458992>) at /usr/local/src/conda/python-3.12.7/Python/ceval.c:1683
#42 _PyFunction_Vectorcall (kwnames=0x0, nargsf=<optimized out>, stack=0x7ffdb8f07090, func=0x77f6c48dede0) at /usr/local/src/conda/python-3.12.7/Objects/call.c:419
#43 _PyObject_FastCallDictTstate (tstate=<optimized out>, callable=0x77f6c48dede0, args=0x7ffdb8f07090, nargsf=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.12.7/Objects/call.c:133
#44 0x0000000000557e56 in _PyObject_Call_Prepend (tstate=0x9bfb70 <_PyRuntime+458992>, callable=0x77f6c48dede0, obj=0x77f6b7b81430, args=<optimized out>, kwargs=0x0)
    at /usr/local/src/conda/python-3.12.7/Objects/call.c:508
#45 0x000000000062f066 in slot_tp_call (self=0x77f6b7b81430, args=0x77f6b7b21c00, kwds=0x0) at /usr/local/src/conda/python-3.12.7/Objects/typeobject.c:8782
#46 0x000000000051af9b in _PyObject_MakeTpCall (tstate=0x9bfb70 <_PyRuntime+458992>, callable=0x77f6b7b81430, args=<optimized out>, nargs=<optimized out>, keywords=0x0)
    at /usr/local/src/conda/python-3.12.7/Objects/call.c:240
#47 0x0000000000525903 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x77f78a892140, throwflag=<optimized out>) at Python/bytecodes.c:2715
#48 0x00000000005e3c6e in PyEval_EvalCode (co=<optimized out>, globals=0x77f78a9f1480, locals=<optimized out>) at /usr/local/src/conda/python-3.12.7/Python/ceval.c:578
#49 0x000000000060a0b7 in run_eval_code_obj (tstate=0x9bfb70 <_PyRuntime+458992>, co=0x3f22e9a0, globals=0x77f78a9f1480, locals=0x77f78a9f1480) at /usr/local/src/conda/python-3.12.7/Python/pythonrun.c:1722
#50 0x00000000006056d7 in run_mod (mod=<optimized out>, filename=0x77f78a927c90, globals=0x77f78a9f1480, locals=0x77f78a9f1480, flags=0x7ffdb8f075b0, arena=0x77f78a913c90)
    at /usr/local/src/conda/python-3.12.7/Python/pythonrun.c:1743
#51 0x000000000061d602 in pyrun_file (fp=fp@entry=0x3f1c39c0, filename=filename@entry=0x77f78a927c90, start=start@entry=257, globals=globals@entry=0x77f78a9f1480, locals=locals@entry=0x77f78a9f1480, 
    closeit=closeit@entry=1, flags=0x7ffdb8f075b0) at /usr/local/src/conda/python-3.12.7/Python/pythonrun.c:1643
#52 0x000000000061cf40 in _PyRun_SimpleFileObject (fp=0x3f1c39c0, filename=0x77f78a927c90, closeit=1, flags=0x7ffdb8f075b0) at /usr/local/src/conda/python-3.12.7/Python/pythonrun.c:433
#53 0x000000000061cd33 in _PyRun_AnyFileObject (fp=0x3f1c39c0, filename=0x77f78a927c90, closeit=1, flags=0x7ffdb8f075b0) at /usr/local/src/conda/python-3.12.7/Python/pythonrun.c:78
#54 0x0000000000615dc3 in pymain_run_file_obj (skip_source_first_line=0, filename=0x77f78a927c90, program_name=0x77f78a9ec270) at /usr/local/src/conda/python-3.12.7/Modules/main.c:360
#55 pymain_run_file (config=0x962750 <_PyRuntime+77008>) at /usr/local/src/conda/python-3.12.7/Modules/main.c:379
#56 pymain_run_python (exitcode=0x7ffdb8f07584) at /usr/local/src/conda/python-3.12.7/Modules/main.c:633
#57 Py_RunMain () at /usr/local/src/conda/python-3.12.7/Modules/main.c:713
#58 0x00000000005cc5b9 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.12.7/Modules/main.c:767
#59 0x000077f78a629d90 in __libc_start_call_main (main=main@entry=0x5cc4f0 <main>, argc=argc@entry=2, argv=argv@entry=0x7ffdb8f07808) at ../sysdeps/nptl/libc_start_call_main.h:58
#60 0x000077f78a629e40 in __libc_start_main_impl (main=0x5cc4f0 <main>, argc=2, argv=0x7ffdb8f07808, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdb8f077f8)
    at ../csu/libc-start.c:392
#61 0x00000000005cc3e9 in _start ()

The text was updated successfully, but these errors were encountered:

nogilnick · 2025-01-16T20:22:28Z

Seeing similar behavior today with most threads stuck on __futex_abstimed_wait_common64. Slightly more interesting trace on the first thread this time:

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__GI___ioctl (fd=66, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
36	../sysdeps/unix/sysv/linux/ioctl.c: No such file or directory.
(gdb) bt
#0  __GI___ioctl (fd=66, request=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00007e4b94f28d00 in ?? () from /opt/rocm-6.3.1/lib/libhsa-runtime64.so.1
#2  0x00007e4b94f216f4 in ?? () from /opt/rocm-6.3.1/lib/libhsa-runtime64.so.1
#3  0x00007e4b94f21ecb in ?? () from /opt/rocm-6.3.1/lib/libhsa-runtime64.so.1
#4  0x00007e4b94e6faf0 in ?? () from /opt/rocm-6.3.1/lib/libhsa-runtime64.so.1
#5  0x00007e4b94e6f7ce in ?? () from /opt/rocm-6.3.1/lib/libhsa-runtime64.so.1
#6  0x00007e4b94e63b79 in ?? () from /opt/rocm-6.3.1/lib/libhsa-runtime64.so.1
#7  0x00007e4b8a9882fc in ?? () from /opt/rocm-6.3.1/lib/libamdhip64.so.6
#8  0x00007e4b8a988610 in ?? () from /opt/rocm-6.3.1/lib/libamdhip64.so.6
#9  0x00007e4b8a98a5ee in ?? () from /opt/rocm-6.3.1/lib/libamdhip64.so.6
#10 0x00007e4b8a9c4be9 in ?? () from /opt/rocm-6.3.1/lib/libamdhip64.so.6
#11 0x00007e4b8a9c6317 in ?? () from /opt/rocm-6.3.1/lib/libamdhip64.so.6
#12 0x00007e4b8a9c66eb in ?? () from /opt/rocm-6.3.1/lib/libamdhip64.so.6
#13 0x00007e4b8a987581 in ?? () from /opt/rocm-6.3.1/lib/libamdhip64.so.6
#14 0x00007e4b8a94e3f9 in ?? () from /opt/rocm-6.3.1/lib/libamdhip64.so.6
#15 0x00007e4b8a7c9a2a in ?? () from /opt/rocm-6.3.1/lib/libamdhip64.so.6
#16 0x00007e4b8a7dd95a in ?? () from /opt/rocm-6.3.1/lib/libamdhip64.so.6
#17 0x00007e4b975fc0fa in stream_executor::gpu::GpuDriver::AsynchronousMemcpyH2D(stream_executor::gpu::GpuContext*, void*, void const*, unsigned long, ihipStream_t*) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#18 0x00007e4b975c8e1d in stream_executor::gpu::GpuExecutor::Memcpy(stream_executor::Stream*, stream_executor::DeviceMemoryBase*, void const*, unsigned long) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#19 0x00007e4b97614562 in stream_executor::StreamCommon::Memcpy(stream_executor::DeviceMemoryBase*, void const*, unsigned long) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#20 0x00007e4b968c0c72 in tensorflow::GPUUtil::CopyCPUTensorToGPU(tensorflow::Tensor const*, tensorflow::DeviceContext const*, tensorflow::Device*, tensorflow::Tensor*, std::function<void (absl::lts_20230802::Status const&)>, bool) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#21 0x00007e4b968c24db in tensorflow::GPUDeviceContext::CopyCPUTensorToDevice(tensorflow::Tensor const*, tensorflow::Device*, tensorflow::Tensor*, std::function<void (absl::lts_20230802::Status const&)>, bool) const
    ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#22 0x00007e4b96a1fa11 in tensorflow::(anonymous namespace)::CopyHostToDevice(tensorflow::Tensor const*, tsl::Allocator*, tsl::Allocator*, std::basic_string_view<char, std::char_traits<char> >, tensorflow::Device*, ten--Type <RET> for m--Type <RET>--Type <RET> for m--Type <--Type--Type----------Type--Type <--Ty--Type <RET> for more, q to quit, c to continue without paging--
sorflow::Tensor*, tensorflow::DeviceContext*, std::function<void (absl::lts_20230802::Status const&)>, bool) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#23 0x00007e4b96a1eb58 in tensorflow::CopyTensor::ViaDMA(std::basic_string_view<char, std::char_traits<char> >, tensorflow::DeviceContext*, tensorflow::DeviceContext*, tensorflow::Device*, tensorflow::Device*, tsl::AllocatorAttributes, tsl::AllocatorAttributes, tensorflow::Tensor const*, tensorflow::Tensor*, int, std::function<void (absl::lts_20230802::Status const&)>, bool) ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2
#24 0x00007e4b6684ad13 in tensorflow::TensorHandle::CopyToDevice(tensorflow::EagerContext const&, tensorflow::Device*, tensorflow::Tensor*) const ()
   from /home/<redacted>/anaconda3/envs/tf_rocm/lib/python3.12/site-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2
#25 0x00007e4b668229b2 in tensorflow::CopyToDeviceNode::Run() ()
...

ppanchad-amd · 2025-01-31T15:16:16Z

Hi @nogilnick. Internal ticket has been created to investigate your issue. Thanks!

ppanchad-amd added the Under Investigation label Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential deadlock during training #2805

Potential deadlock during training #2805

nogilnick commented Jan 16, 2025 •

edited

Loading

nogilnick commented Jan 16, 2025

ppanchad-amd commented Jan 31, 2025

Potential deadlock during training #2805

Potential deadlock during training #2805

Comments

nogilnick commented Jan 16, 2025 • edited Loading

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

nogilnick commented Jan 16, 2025

ppanchad-amd commented Jan 31, 2025

nogilnick commented Jan 16, 2025 •

edited

Loading