-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to train models with multiple gpus? #37
Comments
Thank you for your attention to TFB!
It would be appreciated if you could provide the script/command that caused the issue. I guess from your description you are trying to run multi-GPU jobs with ray backend, which is not fully supported from two aspects:
So, I'd suggest using the sequential backend if your model (most of our baselines require further modification to make it work with multiple GPUs) works only with more than one GPU or using the Ray backend and limiting the model to use only one GPU. |
Thank you for your response. To clarify my original issue further, I would like to provide the following additional details:
I still have some points of confusion:
I believe that supporting multi - GPU training is of great significance for TFB. If possible, I am more than willing to assist in implementing this additional functionality using Thank you for your attention. |
|
I have made certain modifications to enable multi - GPU training with the Sequential Backend. This modification has been tested on a server equipped with four GPUs, and it has been working smoothly. If possible, I can conduct more detailed tests and then incorporate this function into TFB. |
Thank you for your interest in contributing to our project!
Once you've addressed these points, feel free to submit your pull request. We look forward to reviewing your contribution! Thank you for your collaboration and support. |
TFB is truly one of the best time series benchmarks I have ever had the pleasure of using. However, I have encountered an issue when attempting to train models using multiple GPUs.
As you may be aware, the Ray backend is employed for parallel processing. When I run the training of a model using the Ray backend on a Linux server equipped with four GPUs, only a single GPU is actually utilized. Moreover, the working GPU can change in different experiments, sometimes being cuda:0 and at other times cuda:1. Through extensive debugging, I am certain that the Ray backend successfully detects all the GPUs on the server. Nevertheless, only one GPU can be used. It appears that the parallel training aspect with PyTorch is not being realized, or perhaps I have overlooked a relevant part.
I would greatly appreciate your assistance in addressing this issue. Thank you very much!
Best regards.
The text was updated successfully, but these errors were encountered: