distributed training across multiple machines #74

zzk88862 · 2025-02-06T13:27:21Z

hi, excellent work
I want to conduct large-scale data experiments to evaluate the i2v effect. How can I implement distributed training across multiple machines?

tks

Sarania · 2025-02-06T15:36:46Z

I guess if you wanted to do that with musubi the best way would be by configuring accelerate appropriately and then launching with a custom accelerate config. You can be guided through this setup with "accelerate config --config_file distrubuted.yaml" which will ask you questions about your setup and then when launching musubi change your command to "accelerate launch --config_file distributed.yaml hv_train_network.py etc..." but you will note under Features in the musubi readme.md it says "Multi-GPU support not implemented" so ymmv.

Otherwise take a look at https://github.com/tdrussell/diffusion-pipe which is more designed around distributed training, whereas musubi is more aimed at overall memory efficiency for a single device.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed training across multiple machines #74

distributed training across multiple machines #74

zzk88862 commented Feb 6, 2025

Sarania commented Feb 6, 2025

distributed training across multiple machines #74

distributed training across multiple machines #74

Comments

zzk88862 commented Feb 6, 2025

Sarania commented Feb 6, 2025