Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train hang error #31

Open
justgogorunrun opened this issue Jan 17, 2025 · 1 comment
Open

train hang error #31

justgogorunrun opened this issue Jan 17, 2025 · 1 comment

Comments

@justgogorunrun
Copy link

Have you ever encountered the problem that the card owner always enters the hang state during training? Sometimes it is after inputting a fixed data , and sometimes when it is directly reading the video? But the GPU memory is obviously only half occupied?

@2U1
Copy link
Owner

2U1 commented Jan 17, 2025

I haven't got that issue when running in my env. It could be somekind of mismatch with cuda/deepspeed/pytorch.
Does that happens the same when using the Zero optimization with stage 2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants