Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed training across multiple machines #74

Open
zzk88862 opened this issue Feb 6, 2025 · 1 comment
Open

distributed training across multiple machines #74

zzk88862 opened this issue Feb 6, 2025 · 1 comment

Comments

@zzk88862
Copy link

zzk88862 commented Feb 6, 2025

hi, excellent work
I want to conduct large-scale data experiments to evaluate the i2v effect. How can I implement distributed training across multiple machines?

tks

@Sarania
Copy link

Sarania commented Feb 6, 2025

I guess if you wanted to do that with musubi the best way would be by configuring accelerate appropriately and then launching with a custom accelerate config. You can be guided through this setup with "accelerate config --config_file distrubuted.yaml" which will ask you questions about your setup and then when launching musubi change your command to "accelerate launch --config_file distributed.yaml hv_train_network.py etc..." but you will note under Features in the musubi readme.md it says "Multi-GPU support not implemented" so ymmv.

Otherwise take a look at https://github.com/tdrussell/diffusion-pipe which is more designed around distributed training, whereas musubi is more aimed at overall memory efficiency for a single device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants