forked from tensorflow/tensorflow
-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential deadlock during training #2805
Labels
Comments
Seeing similar behavior today with most threads stuck on
|
Hi @nogilnick. Internal ticket has been created to investigate your issue. Thanks! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
No
Source
binary
TensorFlow version
2.17.0
Custom code
No
OS platform and distribution
Ubuntu 22.04
Mobile device
No response
Python version
3.12.7
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current behavior?
Using Radeon RX 6400 with rocm-6.3.1. Ubuntu 22.04. Installed tensorflow-rocm via:
pip install tensorflow-rocm==2.17.0 -f https://repo.radeon.com/rocm/manylinux/rocm-rel-6.3/ --upgrade
Training several different simple models in Keras.
Training mostly work well, but occasionally hit what appears to be a deadlock: training progress stops and GPU seems to start idling. This happens fairly intermittently. Sometimes I can run for hours without hitting it.
Attached in GDB and it looks like all threads are waiting on
__futex_abstimed_wait_common64
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: