Cannot offload / swap blocks. RuntimeError: "fill cpu" not implemented for `Float8_efm3fn` #90

jnz86 · 2025-02-18T22:37:11Z

I am trying to train photos on a 12GB card in Linux. Getting OOM, so I tried to enable swap blocks 36. Regardless what value I use, I am getting a traceback list that includes this runtime error in the middle.

I don't have a lot of VRAM, so I thought I would try the FP8 model instead of the 25GB FP16. Got this error, swapped it out for the full, same error.

This is my script to start it (shell, but ignore the \s). I copied it from photo (traveling today) so it might a typo somewhere but the theme should be there. I suppose I could have tried to remove --fp8_base. I remember trying fp16_base which isn't a thing apparently.

accelerate launch - -num_cpu_threads_per_process 1 - mixed_precision bf16 hv_train network.py  \
-dit models/ckpts/mp_rank_00 model_ states.pt \
-dataset_config input/config.tom --formers - mixed precision bf16 --fp8_base \
-optimizer_type adamwsbit --learning rate 2e-4 - -gradient_checkpointing \
-max data loader n workers 2 --persistent data loader workers \
-network module networks.lora - network dim 32
-timestep sampling shift discrete_flow_shift 7.0
-max_train_epochs 16--save every n_epochs 5 --seed 42 \
-output dir output/ --output name output_lora --fp8_llm --blocks_to_ swap 32

Any idea what the issue is with swap blocks and "fill_cpu" not implemented?
Are these OK settings to get started outside of my issue?

Generally,
3. In my training data of normal resolution photos, should I pre-crop and resize them?
4. The training seems to be able to handle avi picture, and webp, considering it didn't error out on me. Should I manually make these jpg to be safe?

Really great work, thanks a ton!

The text was updated successfully, but these errors were encountered:

Sylsatra · 2025-02-19T13:17:02Z

Firstly, you dont need fp8_llm. Secondly, i experienced OOM when using the --block to swaps feature, so i recommend you not to use blocks to swap for now. And since you are using xformers, use split_attn too "--xformers --split_attn" (you should use triton and sageattention on Linux btw). I trained some loras using only 8GB VRAM so 12GB is possible!

Sylsatra · 2025-02-19T13:57:34Z

Also, first time seeing Float8_efm3fn ☣

jnz86 · 2025-02-19T23:55:34Z

OK, a couple of things.

I was getting OOM without swap blocks. The FP8 model, no swapping, with -fp8-llm, OOM every time. Just 50 photos in the training set, normal resolution pictures, but musubi seems to bucket them automatically into a dozen or so size buckets for me. I just don't see how anyone with 12GB wouldn't get an immediate OOM like me. Is there more prep/sizing I need to do on my input images? With swap_blocks, so far, I'm able to really dial in the memory. I manually walked the config in to 25 or so to get 11.25GB usage and it's chugging along.
The issue here is that you can't run --fp8_base and swap_blocks it seems! As soon as I took that argument away, it started training (of dubious quality). So I guess that is my solution?
Agreed, I turned xformers to sage-attention. Seems like on a 3080 I'm getting about 3.25s/it or so.
The quality sucks so far! I ran 78 epochs on 50 images, and it's absolute trash. Grainy and useless. So... I'm hoping I just have something else messed up here, I guess.

Sylsatra · 2025-02-20T01:47:07Z

OK, a couple of things.

I was getting OOM without swap blocks. The FP8 model, no swapping, with -fp8-llm, OOM every time. Just 50 photos in the training set, normal resolution pictures, but musubi seems to bucket them automatically into a dozen or so size buckets for me. I just don't see how anyone with 12GB wouldn't get an immediate OOM like me. Is there more prep/sizing I need to do on my input images? With swap_blocks, so far, I'm able to really dial in the memory. I manually walked the config in to 25 or so to get 11.25GB usage and it's chugging along.

The issue here is that you can't run --fp8_base and swap_blocks it seems! As soon as I took that argument away, it started training (of dubious quality). So I guess that is my solution?

Agreed, I turned xformers to sage-attention. Seems like on a 3080 I'm getting about 3.25s/it or so.

The quality sucks so far! I ran 78 epochs on 50 images, and it's absolute trash. Grainy and useless. So... I'm hoping I just have something else messed up here, I guess.

If you are training for a character (not a style) try these to improve the quality:

Check config.toml to see if you have "enable_bucket = true".
--timestep_sampling sigmoid --discrete_flow_shift 1.0
--network_args "loraplus_lr_ratio=4" "exclude_patterns=[r'.single_blocks.']" (lora+ is optional)
Or do the old-school method: try with different learning rates.

I hope this will help you!

Sylsatra · 2025-02-20T04:25:26Z

I managed to use blocks_to_swap for FP8. I'm training on a video dataset so i dont know if this will work for you or not but the key is to clear the data cache, and lower the resolution for long video (to about half or one third the resolution of the short video).
Improvement: From ~48s/it to ~13s/it!

jnz86 · 2025-02-20T19:02:09Z

I’m not using video at all. Not sure if that makes a difference here, but nothing I did with FP8 worked with swap_blocks for me.

I might have to try different learning rates. The output is complete trash and the loss/epoch never drops below .25, definitely trash output. Running for FAR longer than all the guides show.

I’m not training a specific character, rather trying to train pictures of an animal. So, all slightly different specifics, but generally the same.

IDK. I cropped all my photos to 544x960, and have only that one resolution there now. Trying again with above settings.

Sylsatra · 2025-02-21T01:24:46Z

.25 is far too high. Can you share your training parameter?

jnz86 · 2025-02-21T02:53:21Z

I left this thread open in case there is something about the cpu_fill issue. But made a separate one about what else I can try because the output is so awful.

#92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot offload / swap blocks. RuntimeError: "fill cpu" not implemented for `Float8_efm3fn` #90

Cannot offload / swap blocks. RuntimeError: "fill cpu" not implemented for `Float8_efm3fn` #90

jnz86 commented Feb 18, 2025

Sylsatra commented Feb 19, 2025

Sylsatra commented Feb 19, 2025

jnz86 commented Feb 19, 2025 •

edited

Loading

Sylsatra commented Feb 20, 2025

Sylsatra commented Feb 20, 2025

jnz86 commented Feb 20, 2025

Sylsatra commented Feb 21, 2025

jnz86 commented Feb 21, 2025

Cannot offload / swap blocks. RuntimeError: "fill cpu" not implemented for Float8_efm3fn #90

Cannot offload / swap blocks. RuntimeError: "fill cpu" not implemented for Float8_efm3fn #90

Comments

jnz86 commented Feb 18, 2025

Sylsatra commented Feb 19, 2025

Sylsatra commented Feb 19, 2025

jnz86 commented Feb 19, 2025 • edited Loading

Sylsatra commented Feb 20, 2025

Sylsatra commented Feb 20, 2025

jnz86 commented Feb 20, 2025

Sylsatra commented Feb 21, 2025

jnz86 commented Feb 21, 2025

Cannot offload / swap blocks. RuntimeError: "fill cpu" not implemented for `Float8_efm3fn` #90

Cannot offload / swap blocks. RuntimeError: "fill cpu" not implemented for `Float8_efm3fn` #90

jnz86 commented Feb 19, 2025 •

edited

Loading