Properly implement cooldown step parsing and explicitly support WSD schedules #71

fizzAI · 2025-01-01T22:23:24Z

Previously, cooldown steps were defined in the config, but never used anywhere -- this PR fixes that, and brings it up to par with the warmup hyperparams.
Also, inspired by ModernBERT, it adds explicit support for WSD by manually constructing the scheduler kwargs to pass the stable and decay steps

zanussbaum · 2025-01-02T03:53:09Z

hey thanks for this! this looks great but haven't had time to verify it works as intended.

Do you have a wandb or other plot of what the learning rate looks over time? would love to verify that it's doing as intended and can then merge :)

fizzAI · 2025-01-02T19:13:02Z

Working on that now 🫡

fizzAI · 2025-01-02T20:05:48Z

... looks like this puppy has some fixing to do, that graph makes zero sense

either the slight jank i did to get it working on unpinned modern versions of transformers/torch/flash-attn busted it completely, or something's wrong with the learning rate schedule
i should probably not be trying this on a highly untested modernbert mlm config tbf? could also be whats breaking it

zanussbaum · 2025-01-02T20:19:18Z

oh i see yeah i wouldn't test the mlm because we usually converted the models to contrastors format to take advantage of all the flash attention workarounds (also the MLM LR looks like that because I'm guessing you have gradient accumulation enabled.

I would test it on the finetuning script (contrastive_finetune.yaml) and see if you can get the right LR. It's faster train and doesn't involve grad_cache or gradient accumulation

fizzAI · 2025-01-02T20:46:19Z

Ohhh it reports gradacc steps as individual steps 🤦‍♀️ that makes sense as to why the graph is funky!
And thanks for the advice, I was trying out MLM initially because it was what I was most curious about but I suppose that will probably require more debugging than this PR lol, will test out the FTing scripts sometime later

zanussbaum · 2025-01-02T22:05:56Z

Yeah if you wanted to keep the same scripts, I'd turn off gradient accumulation steps and train for a smaller number of steps

fizzAI added 2 commits January 1, 2025 17:15

Add cooldown_pct parameter to config schema

c8317e8

Support WSD with set decays teps

0172747

Workaround for newer flash attention versions

148ade9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly implement cooldown step parsing and explicitly support WSD schedules #71

Properly implement cooldown step parsing and explicitly support WSD schedules #71

fizzAI commented Jan 1, 2025

zanussbaum commented Jan 2, 2025

fizzAI commented Jan 2, 2025

fizzAI commented Jan 2, 2025 •

edited

Loading

zanussbaum commented Jan 2, 2025

fizzAI commented Jan 2, 2025

zanussbaum commented Jan 2, 2025

Properly implement cooldown step parsing and explicitly support WSD schedules #71

Are you sure you want to change the base?

Properly implement cooldown step parsing and explicitly support WSD schedules #71

Conversation

fizzAI commented Jan 1, 2025

zanussbaum commented Jan 2, 2025

fizzAI commented Jan 2, 2025

fizzAI commented Jan 2, 2025 • edited Loading

zanussbaum commented Jan 2, 2025

fizzAI commented Jan 2, 2025

zanussbaum commented Jan 2, 2025

fizzAI commented Jan 2, 2025 •

edited

Loading