Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/best model checkpoint fix #35885

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

seanswyi
Copy link
Contributor

What does this PR do?

Fixes #35609 (TL;DR best_model_checkpoint is being set when the checkpoint may not even exist)

  • Add a new attribute to TrainerState called best_global_step that keeps track of the step where we performed evaluation and achieved a new best metric.
  • Moved setting best_model_checkpoint from _determine_best_metric to _save_checkpoint.
  • Inside _save_checkpoint, check if the best_model_checkpoint actually exists, and set it if so.

This was a bit trickier than I thought it would be, since the Trainer's saving and evaluation logic are not tightly coupled as would be in the save_strategy == "best" scenario.

Before submitting

Who can review?

@muellerzr
@SunMarc
@tomaarsen

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Rather than set it explicitly without checking if the checkpoint directory even exists as before, now we moved the setting logic inside of _save_checkpoint and are only setting it if it exists.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trainer sets state.best_model_checkpoint even when it doesn't save there; leads to training crash
1 participant