Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What can I try if my loss isn't coming down? #92

Open
jnz86 opened this issue Feb 20, 2025 · 22 comments
Open

What can I try if my loss isn't coming down? #92

jnz86 opened this issue Feb 20, 2025 · 22 comments

Comments

@jnz86
Copy link

jnz86 commented Feb 20, 2025

12GB, sage-attention, trying to make an animal not commonly in training more recognizable to the model, 50 photos of animals in varied locations, manually captioned, cropped to 544x960...

I've read that people are getting character models done in a couple hours. I'm at 27000 steps, 500 epochs, 2e-4 rate, buckets, using the FP16 model offloaded with swap blocks, FP16 llm, etc.

I'm in 24 hours now, and the loss is still .25ish. The output is trash.

I've tried everything I could find about flow and lora settings. Nothing is seeming to improve my training outcome.

My tensorboard loss/epoch looks like basically a peak into a slightly downsloping line. Does this mean I'll need hundreds of hours to get down to something that will work?

I feel like something else is wrong. What can I check?

@jnz86
Copy link
Author

jnz86 commented Feb 21, 2025

My training command:

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 hv_train_network.py \
    --dit models/ckpts/mp_rank_00_model_states.pt  \
    --dataset_config input/config.toml --sage_attn --mixed_precision bf16  \
    --optimizer_type adamw8bit --learning_rate 2e-4 --gradient_checkpointing  \
    --max_data_loader_n_workers 2 --persistent_data_loader_workers  \
    --network_module networks.lora --network_dim 32  \
    --network_args "loraplus_lr_ratio=4" "exclude_patterns=[r'.single_blocks.']" \
    --timestep_sampling sigmoid --discrete_flow_shift 1.0 \
    --max_train_epochs 500 --save_every_n_epochs 4 --seed 42 \
    --output_dir output/ --output_name output_lora --blocks_to_swap 26 \
    --log_with tensorboard --logging_dir logs \
    --save_state \

@Sylsatra
Copy link

Wow you are training without fp8 base tag, this is great!
Since your loss is kinda plateau at 2.5 i suspect that your learning rate already rock bottom, the model is not learning much anymore. You should try using any schedule that can "restart".

@Sylsatra
Copy link

Also 27k is wild

@jnz86
Copy link
Author

jnz86 commented Feb 21, 2025

I avoided FP8 initially since I wanted to train using the FP16 full model, but this required swap blocks to fit in RAM (my other issue about that still pending).

I'll also experiment with either revising or removing captions entirely to check for issues there.

27K steps is wild!! I had a 50 hour timer running and thought there is no way this is right! This relatively simple training task, especially since the model already includes some animals similar to mine, just quite underrepresented. The software clearly works for others, so I'm puzzled about what's causing these problems.

Next steps...

I'll try FP8 again. Experiment with revising or removing captions. I found a dataset for a character that seems to be complete, I can try that to see if it's my images. I don't know how learning rate works, or the network args. I can mess with those depending how the previous ideas go.

@Sylsatra
Copy link

Please try again with these tags:
--lr_scheduler cosine_with_restarts --optimizer_args betas=0.9,0.99 weight_decay=0.01 eps=1e-8

@jnz86
Copy link
Author

jnz86 commented Feb 22, 2025

Nope. No action.

  • I tried your settings, almost zero change. Still about .250
  • I found a guy on civit that made a clean lora of a person and posted his training data (thanks stranger!), using his data, and his captions, still no action. Started lower, at about .1, but quickly climbed to .25 or so by the end, a much smaller set of 14 pictures or so.

So, it's me. Or my hardware, or this program, or my dependencies, or something. IDK!! It doesn't appear to be my dataset.

Here is the result from tensorboard for my past few tries. Lots of different settings, results nearly identical. The lowest purple one being the settings suggested with the downloaded dataset. To be fair, that only ran for 45 minutes before I ended it, it's a much smaller set and really ran through the epochs.

Image

I really don't know. I guess I'll try going to pipe and trying that again?

@Sylsatra
Copy link

Nope. No action.

  • I tried your settings, almost zero change. Still about .250
  • I found a guy on civit that made a clean lora of a person and posted his training data (thanks stranger!), using his data, and his captions, still no action. Started lower, at about .1, but quickly climbed to .25 or so by the end, a much smaller set of 14 pictures or so.

So, it's me. Or my hardware, or this program, or my dependencies, or something. IDK!! It doesn't appear to be my dataset.

Here is the result from tensorboard for my past few tries. Lots of different settings, results nearly identical. The lowest purple one being the settings suggested with the downloaded dataset. To be fair, that only ran for 45 minutes before I ended it, it's a much smaller set and really ran through the epochs.

Image

I really don't know. I guess I'll try going to pipe and trying that again?

Can you show me the training parameter for this one?

@jnz86
Copy link
Author

jnz86 commented Feb 23, 2025

Training parameter?

Like the command line to run it? Same as I wrote above., except low on disk space so I save every 10 epochs… because I know it doesn’t work so whatever right?!

I had a thought… that maybe I was using the wrong VAE or CLIPS. That in my cache building that I was using the wrong something and it was effectively jumbling the text all up. I followed install guide again and seem to be using clip_l and lava-llama-3B fp16, correctly. I only had this thought because I tried multiGPU workflows (to offload to system RAM) and needed a GGUF compatible clip otherwise the output was garbage. Could still be the case I guess. I’ll try getting the FP8 model, and doing a generation from musubi to see if it’s working at all.

@Sylsatra
Copy link

My training command:

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 hv_train_network.py \
    --dit models/ckpts/mp_rank_00_model_states.pt  \
    --dataset_config input/config.toml --sage_attn --mixed_precision bf16  \
    --optimizer_type adamw8bit --learning_rate 2e-4 --gradient_checkpointing  \
    --max_data_loader_n_workers 2 --persistent_data_loader_workers  \
    --network_module networks.lora --network_dim 32  \
    --network_args "loraplus_lr_ratio=4" "exclude_patterns=[r'.single_blocks.']" \
    --timestep_sampling sigmoid --discrete_flow_shift 1.0 \
    --max_train_epochs 500 --save_every_n_epochs 4 --seed 42 \
    --output_dir output/ --output_name output_lora --blocks_to_swap 26 \
    --log_with tensorboard --logging_dir logs \
    --save_state \

You should not use this training parameters, it is not good at i mentioned. Please at least add
--lr_scheduler cosine or
lr_scheduler cosine_with_restarts
Then pair that with
--optimizer_args betas=0.9,0.99 weight_decay=0.05 eps=1e-8
Please add these because your loss graph indicated that your lora was not learning anything!

@Sylsatra
Copy link

Training parameter?

Like the command line to run it? Same as I wrote above., except low on disk space so I save every 10 epochs… because I know it doesn’t work so whatever right?!

I had a thought… that maybe I was using the wrong VAE or CLIPS. That in my cache building that I was using the wrong something and it was effectively jumbling the text all up. I followed install guide again and seem to be using clip_l and lava-llama-3B fp16, correctly. I only had this thought because I tried multiGPU workflows (to offload to system RAM) and needed a GGUF compatible clip otherwise the output was garbage. Could still be the case I guess. I’ll try getting the FP8 model, and doing a generation from musubi to see if it’s working at all.

Please clear the cache folder before cacheing the latent again, i dont know if the auto clear cache is implemented for you.

@jnz86
Copy link
Author

jnz86 commented Feb 23, 2025

Sorry, I wasn't clear. I added your suggestions to the above and ran it for the purple above. And generally on a couple of tests since.

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 hv_train_network.py \
    --dit models/ckpts/mp_rank_00_model_states.pt  \
    --dataset_config input/config.toml --sage_attn --mixed_precision bf16  \
    --optimizer_type adamw8bit --learning_rate 2e-4 --gradient_checkpointing  \
    --max_data_loader_n_workers 2 --persistent_data_loader_workers  \
    --network_module networks.lora --network_dim 32  \
    --network_args "loraplus_lr_ratio=4" "exclude_patterns=[r'.single_blocks.']" \
    --lr_scheduler cosine_with_restarts --optimizer_args betas=0.9,0.99 weight_decay=0.01 eps=1e-8 \
   --timestep_sampling sigmoid --discrete_flow_shift 1.0 \
    --max_train_epochs 500 --save_every_n_epochs 2 --seed 42 \
    --output_dir output/ --output_name output_lora --blocks_to_swap 26 \
    --log_with tensorboard --logging_dir logs

I agree! The lora clearly isn't learning anything. That's why I was hopeful that maybe I messed up the text encoders, or something. I manually clear cache every time, so that isn't it. And I've been watching to make sure my caching steps are generating.

EDIT: I’m going to triple check my encoders and checkpoint, make sure that hashes are right. Going to go back and get an FP8 checkpoint that I can use to generate from here, my guess there is if my models were messed up, I couldn’t generate anything. So if I can generate, that means the models are good and then it would have to be my settings somewhere?

@Sylsatra
Copy link

Sylsatra commented Feb 24, 2025

Training parameter?

Like the command line to run it? Same as I wrote above., except low on disk space so I save every 10 epochs… because I know it doesn’t work so whatever right?!

I had a thought… that maybe I was using the wrong VAE or CLIPS. That in my cache building that I was using the wrong something and it was effectively jumbling the text all up. I followed install guide again and seem to be using clip_l and lava-llama-3B fp16, correctly. I only had this thought because I tried multiGPU workflows (to offload to system RAM) and needed a GGUF compatible clip otherwise the output was garbage. Could still be the case I guess. I’ll try getting the FP8 model, and doing a generation from musubi to see if it’s working at all.
Yeah this seems correct if they are Open AI Clip L and Llava Llama 3 8B 1_1.

@Sylsatra
Copy link

Sylsatra commented Feb 24, 2025

Since you are using cosine with restart now, i recommend adding --lr_scheduler_num_cycles 10.
You should change number 10 to number of epoch or half the number of epoch.

Additional: --max_grad_norm 0.3 --scale_weight_norms 1.0 --network_dropout 0.1

And if you still have some spare VRAM, try graduent_accumulation_steps 2 or 4, 6, even 8 for better result. IMO this option is not as good as batch size but consume way less vram!

@jnz86
Copy link
Author

jnz86 commented Feb 24, 2025

I am at a loss! @Sylsatra, your values did do better! But I'm still at .2 and never falls, this software is just not working. Assuming for me only!

Image

The teal is the latest run with all your settings. It did better, and tried harder, but nope, it's just not falling much lower than .20.

I've tried everything I can think of. I wish I didn't need to run the -fp8-llm but I can't fit a 16GB model on an 12GB card. I tried running the fp8_scaled llava-llama but it complains about an incorrect key or sequence or something.

Here is everything I had start to finish. I hope I'm just missing something dumb!!

config.toml

# general configurations
[general]
resolution = [544, 960]
caption_extension = ".txt"
batch_size = 1
enable_bucket = true
bucket_no_upscale = false

[[datasets]]
image_directory = "input/images"
cache_directory = "cache"
num_repeats = 1

Cache Latents

python cache_latents.py --dataset_config input/config.toml --vae models/vae/hunyuan_vae.pt --vae_chunk_size 32 --vae_tiling

# hunyuan_vae.pt renamed from pytorch_model.pt
# SHA256 95d1fc707c1421ccd88ea542838ab4c5d45a5babb48205bac9ce0985525f9818
# 

Cache Text

python cache_text_encoder_outputs.py --dataset_config input/config.toml --text_encoder1 models/clip/llava_llama3_fp16.safetensors --fp8_llm --text_encoder2 models/clip/clip_l.safetensors --batch_size 16

# llava_llama3_fp16 
# SHA256 7e6fdee240ce6fa0537e2c55b473470c2a56e33cf9fcfa2d7bade452f364f390

# clip_l
# SHA256 660c6f5b1abae9dc498ac2d21e1347d2abdb0cf6c0c0c8576cd796491d9a6cdd

Train

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 hv_train_network.py \
    --dit models/ckpts/mp_rank_00_model_states.pt  \
    --dataset_config input/config.toml --sage_attn --mixed_precision bf16  \
    --optimizer_type adamw8bit --learning_rate 2e-4 --gradient_checkpointing  \
    --max_data_loader_n_workers 2 --persistent_data_loader_workers  \
    --max_train_epochs 60 --save_every_n_epochs 2 --seed 42 \
    --timestep_sampling sigmoid --discrete_flow_shift 1.0 \
    --network_module networks.lora --network_dim 32  \
    --network_args "loraplus_lr_ratio=4" "exclude_patterns=[r'.single_blocks.']" \
    --lr_scheduler cosine_with_restarts --optimizer_args betas=0.9,0.99 weight_decay=0.01 eps=1e-8 \
    --lr_scheduler_num_cycles 30 \
    --max_grad_norm 0.3 --scale_weight_norms 1.0 --network_dropout 0.1 \
    --output_dir output/ --output_name output-lora --blocks_to_swap 26 \
    --log_with tensorboard --logging_dir logs \
    --save_state

# mp_rank_00_model_states.pt 25GB
# SHA256 00f4be6dcdc12c9f9a0a412a1000a4ace857c384f2834e6a23b78e3e5d7cac6a

All the really niche settings I have here, do seem to have helped, but also no one else seems to need these. It just seems to work for everyone else. I tried the fp8 model_states.pt, but oddly that gave me NAN loss numbers and didn't seem to do anything but go through the motions.

I'm lost.

@Sarania
Copy link

Sarania commented Feb 24, 2025

Just a heads up "--sage_attn" isn't supported for training unless something I don't know about has changed(possible.) At least it didn't work before:
#2

It's possible that's your problem so maybe try with any of the other attn implementations.

@jnz86
Copy link
Author

jnz86 commented Feb 24, 2025

Just a heads up "--sage_attn" isn't supported for training unless something I don't know about has changed(possible.) At least it didn't work before: #2

It's possible that's your problem so maybe try with any of the other attn implementations.

I mean.... What is it there for then!? Generation only? Seems like it would know if it's trying to train with an unsupported function and you know, warn me about that.

Checks readme...

Use --sage_attn for SageAttention, but SageAttention is not yet supported for training and will not work correctly.

Son of... Ok guys, I mean, YES, the answer is there, but maybe a little louder!? :)

I'll take that away and try again. Thanks for the advice @Sarania. I am 100% this thread will help someone in the future if this was the problem.

@Sarania
Copy link

Sarania commented Feb 24, 2025

Yeah it's easy to get tripped up by it, it got me early on too. It's a shame Sage doesn't support backwards pass, likely you'll notice around 2x per step times than what you were getting with Sage, that's normal since Sage was only doing half the work. FWIW yes it's there for inference(hv_generate_video.py is a fully featured HyVideo inference interface!) but I agree there should be some kind of warning or exception when trying to use Sage for training, @kohya-ss what do you think about that I see people getting tripped up by this a lot.

@Sylsatra
Copy link

Oops sorry for recommending Sage :(
Since you are training on rare, exotic animals maybe the loss is destined to be high (and loss is not everything). However if you really wanna try to improve the quality, you should use shift 7.0 instead of sigmoid (sigmoid is better at training humanoid).

And the use of custom tags will help a lot, e.g try jnz86animalname as a custom tag. (This will help a lot but i recommend one tag per dataset so you gotta change your config.toml)

If you see the average loss fluctuate too much, lower the lr_scheduler_num_cycles to half the value.

In the additional arg
loraplus_lr_ratio=4" "exclude_patterns=[r'.single_blocks.']"
Try increasing the lr ratio to 8 (or decrease to 2) and remove the exclude patterns to train all blocks.

The last resort is to increase the rank to 64.

Tbh i dont know why you have to use fp8_llm. I dont need it and i only have 8GB of Vram.

Finally, flash_attn is a little bit faster than xformers with about the same loss. But pair them with --split_attn

@jnz86
Copy link
Author

jnz86 commented Feb 26, 2025

Thanks @Sarania! Very helpful. I can't say my results are excellent yet. That's going to take some more work, but it is 1/2 the speed now and it does seem that specific issue has resolved.

@Sylsatra No problem! Now we both know!

Well, I don't know how you can not use --fp8_llm, that llava-llama model is 16GB. No idea, I get OOM if I attempt anything without it. I am using Linux but I doubt that matters here.

I'll retest with settings tweaks and --split_attn.

@Sarania
Copy link

Sarania commented Feb 26, 2025

FWIW I need --fp8_llm to do samples while training(even though the TE outputs are cached upfront), but not during caching latents and I have 16GB. Realistically it's not likely hurting much, I've worked with the llava models in my private LLM projects including llava-llama-8b and they quantize nicely. I wouldn't stress about it.

Of note is on Windows by default it's configured that however much VRAM you have, you have that much system RAM to fall back on when it's exhausted (so e.g. if you have 8GB VRAM you get 8GB more shared with sysram) but doing so can be slow especially if you use a lot though it's very nice to have in a pinch. Maybe that's how @Sylsatra is doing it because I'm curious too lol and that's all I could think of. I'm on Linux (Endeavour) myself as well.

@Sylsatra
Copy link

Thanks @Sarania! Very helpful. I can't say my results are excellent yet. That's going to take some more work, but it is 1/2 the speed now and it does seem that specific issue has resolved.

@Sylsatra No problem! Now we both know!

Well, I don't know how you can not use --fp8_llm, that llava-llama model is 16GB. No idea, I get OOM if I attempt anything without it. I am using Linux but I doubt that matters here.

I'll retest with settings tweaks and --split_attn.

Perfect, i hope your training will be great!

@Sylsatra
Copy link

FWIW I need --fp8_llm to do samples while training(even though the TE outputs are cached upfront), but not during caching latents and I have 16GB. Realistically it's not likely hurting much, I've worked with the llava models in my private LLM projects including llava-llama-8b and they quantize nicely. I wouldn't stress about it.

Of note is on Windows by default it's configured that however much VRAM you have, you have that much system RAM to fall back on when it's exhausted (so e.g. if you have 8GB VRAM you get 8GB more shared with sysram) but doing so can be slow especially if you use a lot though it's very nice to have in a pinch. Maybe that's how @Sylsatra is doing it because I'm curious too lol and that's all I could think of. I'm on Linux (Endeavour) myself as well.

Cool, i think you got it w⁠(⁠°⁠o⁠°⁠)⁠w i do have a lot of RAM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants