-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What can I try if my loss isn't coming down? #92
Comments
My training command:
|
Wow you are training without fp8 base tag, this is great! |
Also 27k is wild |
I avoided FP8 initially since I wanted to train using the FP16 full model, but this required swap blocks to fit in RAM (my other issue about that still pending). I'll also experiment with either revising or removing captions entirely to check for issues there. 27K steps is wild!! I had a 50 hour timer running and thought there is no way this is right! This relatively simple training task, especially since the model already includes some animals similar to mine, just quite underrepresented. The software clearly works for others, so I'm puzzled about what's causing these problems. Next steps... I'll try FP8 again. Experiment with revising or removing captions. I found a dataset for a character that seems to be complete, I can try that to see if it's my images. I don't know how learning rate works, or the network args. I can mess with those depending how the previous ideas go. |
Please try again with these tags: |
Nope. No action.
So, it's me. Or my hardware, or this program, or my dependencies, or something. IDK!! It doesn't appear to be my dataset. Here is the result from tensorboard for my past few tries. Lots of different settings, results nearly identical. The lowest purple one being the settings suggested with the downloaded dataset. To be fair, that only ran for 45 minutes before I ended it, it's a much smaller set and really ran through the epochs. I really don't know. I guess I'll try going to pipe and trying that again? |
Can you show me the training parameter for this one? |
Training parameter? Like the command line to run it? Same as I wrote above., except low on disk space so I save every 10 epochs… because I know it doesn’t work so whatever right?! I had a thought… that maybe I was using the wrong VAE or CLIPS. That in my cache building that I was using the wrong something and it was effectively jumbling the text all up. I followed install guide again and seem to be using clip_l and lava-llama-3B fp16, correctly. I only had this thought because I tried multiGPU workflows (to offload to system RAM) and needed a GGUF compatible clip otherwise the output was garbage. Could still be the case I guess. I’ll try getting the FP8 model, and doing a generation from musubi to see if it’s working at all. |
You should not use this training parameters, it is not good at i mentioned. Please at least add |
Please clear the cache folder before cacheing the latent again, i dont know if the auto clear cache is implemented for you. |
Sorry, I wasn't clear. I added your suggestions to the above and ran it for the purple above. And generally on a couple of tests since.
I agree! The lora clearly isn't learning anything. That's why I was hopeful that maybe I messed up the text encoders, or something. I manually clear cache every time, so that isn't it. And I've been watching to make sure my caching steps are generating. EDIT: I’m going to triple check my encoders and checkpoint, make sure that hashes are right. Going to go back and get an FP8 checkpoint that I can use to generate from here, my guess there is if my models were messed up, I couldn’t generate anything. So if I can generate, that means the models are good and then it would have to be my settings somewhere? |
|
Since you are using cosine with restart now, i recommend adding --lr_scheduler_num_cycles 10. Additional: --max_grad_norm 0.3 --scale_weight_norms 1.0 --network_dropout 0.1 And if you still have some spare VRAM, try graduent_accumulation_steps 2 or 4, 6, even 8 for better result. IMO this option is not as good as batch size but consume way less vram! |
I am at a loss! @Sylsatra, your values did do better! But I'm still at .2 and never falls, this software is just not working. Assuming for me only! The teal is the latest run with all your settings. It did better, and tried harder, but nope, it's just not falling much lower than .20. I've tried everything I can think of. I wish I didn't need to run the -fp8-llm but I can't fit a 16GB model on an 12GB card. I tried running the fp8_scaled llava-llama but it complains about an incorrect key or sequence or something. Here is everything I had start to finish. I hope I'm just missing something dumb!! config.toml
Cache Latents
Cache Text
Train
All the really niche settings I have here, do seem to have helped, but also no one else seems to need these. It just seems to work for everyone else. I tried the fp8 model_states.pt, but oddly that gave me NAN loss numbers and didn't seem to do anything but go through the motions. I'm lost. |
Just a heads up "--sage_attn" isn't supported for training unless something I don't know about has changed(possible.) At least it didn't work before: It's possible that's your problem so maybe try with any of the other attn implementations. |
I mean.... What is it there for then!? Generation only? Seems like it would know if it's trying to train with an unsupported function and you know, warn me about that. Checks readme...
Son of... Ok guys, I mean, YES, the answer is there, but maybe a little louder!? :) I'll take that away and try again. Thanks for the advice @Sarania. I am 100% this thread will help someone in the future if this was the problem. |
Yeah it's easy to get tripped up by it, it got me early on too. It's a shame Sage doesn't support backwards pass, likely you'll notice around 2x per step times than what you were getting with Sage, that's normal since Sage was only doing half the work. FWIW yes it's there for inference(hv_generate_video.py is a fully featured HyVideo inference interface!) but I agree there should be some kind of warning or exception when trying to use Sage for training, @kohya-ss what do you think about that I see people getting tripped up by this a lot. |
Oops sorry for recommending Sage :( And the use of custom tags will help a lot, e.g try jnz86animalname as a custom tag. (This will help a lot but i recommend one tag per dataset so you gotta change your config.toml) If you see the average loss fluctuate too much, lower the lr_scheduler_num_cycles to half the value. In the additional arg The last resort is to increase the rank to 64. Tbh i dont know why you have to use fp8_llm. I dont need it and i only have 8GB of Vram. Finally, flash_attn is a little bit faster than xformers with about the same loss. But pair them with --split_attn |
Thanks @Sarania! Very helpful. I can't say my results are excellent yet. That's going to take some more work, but it is 1/2 the speed now and it does seem that specific issue has resolved. @Sylsatra No problem! Now we both know! Well, I don't know how you can not use --fp8_llm, that llava-llama model is 16GB. No idea, I get OOM if I attempt anything without it. I am using Linux but I doubt that matters here. I'll retest with settings tweaks and --split_attn. |
FWIW I need --fp8_llm to do samples while training(even though the TE outputs are cached upfront), but not during caching latents and I have 16GB. Realistically it's not likely hurting much, I've worked with the llava models in my private LLM projects including llava-llama-8b and they quantize nicely. I wouldn't stress about it. Of note is on Windows by default it's configured that however much VRAM you have, you have that much system RAM to fall back on when it's exhausted (so e.g. if you have 8GB VRAM you get 8GB more shared with sysram) but doing so can be slow especially if you use a lot though it's very nice to have in a pinch. Maybe that's how @Sylsatra is doing it because I'm curious too lol and that's all I could think of. I'm on Linux (Endeavour) myself as well. |
Perfect, i hope your training will be great! |
Cool, i think you got it w(°o°)w i do have a lot of RAM! |
12GB, sage-attention, trying to make an animal not commonly in training more recognizable to the model, 50 photos of animals in varied locations, manually captioned, cropped to 544x960...
I've read that people are getting character models done in a couple hours. I'm at 27000 steps, 500 epochs, 2e-4 rate, buckets, using the FP16 model offloaded with swap blocks, FP16 llm, etc.
I'm in 24 hours now, and the loss is still .25ish. The output is trash.
I've tried everything I could find about flow and lora settings. Nothing is seeming to improve my training outcome.
My tensorboard loss/epoch looks like basically a peak into a slightly downsloping line. Does this mean I'll need hundreds of hours to get down to something that will work?
I feel like something else is wrong. What can I check?
The text was updated successfully, but these errors were encountered: