Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot offload / swap blocks. RuntimeError: "fill cpu" not implemented for Float8_efm3fn #90

Open
jnz86 opened this issue Feb 18, 2025 · 8 comments

Comments

@jnz86
Copy link

jnz86 commented Feb 18, 2025

I am trying to train photos on a 12GB card in Linux. Getting OOM, so I tried to enable swap blocks 36. Regardless what value I use, I am getting a traceback list that includes this runtime error in the middle.

I don't have a lot of VRAM, so I thought I would try the FP8 model instead of the 25GB FP16. Got this error, swapped it out for the full, same error.

This is my script to start it (shell, but ignore the \s). I copied it from photo (traveling today) so it might a typo somewhere but the theme should be there. I suppose I could have tried to remove --fp8_base. I remember trying fp16_base which isn't a thing apparently.

accelerate launch - -num_cpu_threads_per_process 1 - mixed_precision bf16 hv_train network.py  \
-dit models/ckpts/mp_rank_00 model_ states.pt \
-dataset_config input/config.tom --formers - mixed precision bf16 --fp8_base \
-optimizer_type adamwsbit --learning rate 2e-4 - -gradient_checkpointing \
-max data loader n workers 2 --persistent data loader workers \
-network module networks.lora - network dim 32
-timestep sampling shift discrete_flow_shift 7.0
-max_train_epochs 16--save every n_epochs 5 --seed 42 \
-output dir output/ --output name output_lora --fp8_llm --blocks_to_ swap 32
  1. Any idea what the issue is with swap blocks and "fill_cpu" not implemented?
  2. Are these OK settings to get started outside of my issue?

Generally,
3. In my training data of normal resolution photos, should I pre-crop and resize them?
4. The training seems to be able to handle avi picture, and webp, considering it didn't error out on me. Should I manually make these jpg to be safe?

Really great work, thanks a ton!

@Sylsatra
Copy link

Firstly, you dont need fp8_llm. Secondly, i experienced OOM when using the --block to swaps feature, so i recommend you not to use blocks to swap for now. And since you are using xformers, use split_attn too "--xformers --split_attn" (you should use triton and sageattention on Linux btw). I trained some loras using only 8GB VRAM so 12GB is possible!

@Sylsatra
Copy link

Also, first time seeing Float8_efm3fn ☣

@jnz86
Copy link
Author

jnz86 commented Feb 19, 2025

OK, a couple of things.

  1. I was getting OOM without swap blocks. The FP8 model, no swapping, with -fp8-llm, OOM every time. Just 50 photos in the training set, normal resolution pictures, but musubi seems to bucket them automatically into a dozen or so size buckets for me. I just don't see how anyone with 12GB wouldn't get an immediate OOM like me. Is there more prep/sizing I need to do on my input images? With swap_blocks, so far, I'm able to really dial in the memory. I manually walked the config in to 25 or so to get 11.25GB usage and it's chugging along.

  2. The issue here is that you can't run --fp8_base and swap_blocks it seems! As soon as I took that argument away, it started training (of dubious quality). So I guess that is my solution?

  3. Agreed, I turned xformers to sage-attention. Seems like on a 3080 I'm getting about 3.25s/it or so.

  4. The quality sucks so far! I ran 78 epochs on 50 images, and it's absolute trash. Grainy and useless. So... I'm hoping I just have something else messed up here, I guess.

@Sylsatra
Copy link

OK, a couple of things.

  1. I was getting OOM without swap blocks. The FP8 model, no swapping, with -fp8-llm, OOM every time. Just 50 photos in the training set, normal resolution pictures, but musubi seems to bucket them automatically into a dozen or so size buckets for me. I just don't see how anyone with 12GB wouldn't get an immediate OOM like me. Is there more prep/sizing I need to do on my input images? With swap_blocks, so far, I'm able to really dial in the memory. I manually walked the config in to 25 or so to get 11.25GB usage and it's chugging along.
  2. The issue here is that you can't run --fp8_base and swap_blocks it seems! As soon as I took that argument away, it started training (of dubious quality). So I guess that is my solution?
  3. Agreed, I turned xformers to sage-attention. Seems like on a 3080 I'm getting about 3.25s/it or so.
  4. The quality sucks so far! I ran 78 epochs on 50 images, and it's absolute trash. Grainy and useless. So... I'm hoping I just have something else messed up here, I guess.

If you are training for a character (not a style) try these to improve the quality:

  • Check config.toml to see if you have "enable_bucket = true".
  • --timestep_sampling sigmoid --discrete_flow_shift 1.0
  • --network_args "loraplus_lr_ratio=4" "exclude_patterns=[r'.single_blocks.']" (lora+ is optional)
  • Or do the old-school method: try with different learning rates.

I hope this will help you!

@Sylsatra
Copy link

I managed to use blocks_to_swap for FP8. I'm training on a video dataset so i dont know if this will work for you or not but the key is to clear the data cache, and lower the resolution for long video (to about half or one third the resolution of the short video).
Improvement: From ~48s/it to ~13s/it!

@jnz86
Copy link
Author

jnz86 commented Feb 20, 2025

I’m not using video at all. Not sure if that makes a difference here, but nothing I did with FP8 worked with swap_blocks for me.

I might have to try different learning rates. The output is complete trash and the loss/epoch never drops below .25, definitely trash output. Running for FAR longer than all the guides show.

I’m not training a specific character, rather trying to train pictures of an animal. So, all slightly different specifics, but generally the same.

IDK. I cropped all my photos to 544x960, and have only that one resolution there now. Trying again with above settings.

@Sylsatra
Copy link

.25 is far too high. Can you share your training parameter?

@jnz86
Copy link
Author

jnz86 commented Feb 21, 2025

I left this thread open in case there is something about the cpu_fill issue. But made a separate one about what else I can try because the output is so awful.

#92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants