Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly implement cooldown step parsing and explicitly support WSD schedules #71

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

fizzAI
Copy link

@fizzAI fizzAI commented Jan 1, 2025

Previously, cooldown steps were defined in the config, but never used anywhere -- this PR fixes that, and brings it up to par with the warmup hyperparams.
Also, inspired by ModernBERT, it adds explicit support for WSD by manually constructing the scheduler kwargs to pass the stable and decay steps

@zanussbaum
Copy link
Collaborator

hey thanks for this! this looks great but haven't had time to verify it works as intended.

Do you have a wandb or other plot of what the learning rate looks over time? would love to verify that it's doing as intended and can then merge :)

@fizzAI
Copy link
Author

fizzAI commented Jan 2, 2025

Working on that now 🫡

@fizzAI
Copy link
Author

fizzAI commented Jan 2, 2025

... looks like this puppy has some fixing to do, that graph makes zero sense
image
image
either the slight jank i did to get it working on unpinned modern versions of transformers/torch/flash-attn busted it completely, or something's wrong with the learning rate schedule
i should probably not be trying this on a highly untested modernbert mlm config tbf? could also be whats breaking it

@zanussbaum
Copy link
Collaborator

oh i see yeah i wouldn't test the mlm because we usually converted the models to contrastors format to take advantage of all the flash attention workarounds (also the MLM LR looks like that because I'm guessing you have gradient accumulation enabled.

I would test it on the finetuning script (contrastive_finetune.yaml) and see if you can get the right LR. It's faster train and doesn't involve grad_cache or gradient accumulation

@fizzAI
Copy link
Author

fizzAI commented Jan 2, 2025

Ohhh it reports gradacc steps as individual steps 🤦‍♀️ that makes sense as to why the graph is funky!
And thanks for the advice, I was trying out MLM initially because it was what I was most curious about but I suppose that will probably require more debugging than this PR lol, will test out the FTing scripts sometime later

@zanussbaum
Copy link
Collaborator

Yeah if you wanted to keep the same scripts, I'd turn off gradient accumulation steps and train for a smaller number of steps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants