Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic weight averaging callback that supports EMA #20545

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

senarvi
Copy link
Contributor

@senarvi senarvi commented Jan 14, 2025

A callback that updates an AveragedModel after every training step

What does this PR do?

This is similar to the existing StochasticWeightAveraging callback, but uses the AveragedModel class from PyTorch. Reduced code duplication means easier maintenance. Also, any averaging function can be used. Currently this callback does averaging on every step. We could make this callback support both SWA and EMA, or we could still have different callbacks ("StepwiseAveragingCallback" and "EpochwiseAveragingCallback"). The biggest questions:

  • Constructs the AveragedModel with use_buffers=True, so that an extra step is not needed for updating the batch normalization statistics. StochasticWeightAveraging performs an extra step in the end. Consequently the implementation is significantly more complex and it's difficult to make sure that it works in all cases. Should we add this as an option in this class too?

  • Updates the average model after every step. StochasticWeightAveraging updates the average model after every epoch, and I recall that the original paper updated it only at certain points (the learning rate minima). I guess it would be nice to be able to select whether the average model will be updated after every step, after every epoch, or after certain epochs. Then we would need only one callback and we could remove the StochasticWeightAveraging callback, but would it make this class too complex?

Fixes #10914

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs) => Discussed in issue Add feature Exponential Moving Average (EMA) #10914
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary) => TODO
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors) => TODO

PR review

This pull request is still work in progress and opened for dicussion.
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20545.org.readthedocs.build/en/20545/

@github-actions github-actions bot added docs Documentation related pl Generic label for PyTorch Lightning package labels Jan 14, 2025
@lantiga
Copy link
Collaborator

lantiga commented Jan 14, 2025

Hey @senarvi, this looks great!

I saw you already added support for saving and resuming which is great. There are many scenarios there (save every n steps, time-based, every epoch, etc) let's make sure we cover them all (for inspiration, we added quite a few tests here #20379)

we could still have different callbacks ("StepwiseAveragingCallback" and "EpochwiseAveragingCallback")

No I think it's better to have one with configurable averaging flags, more lightning-esque

Constructs the AveragedModel with use_buffers=True, so that an extra step is not needed for updating the batch normalization statistics. StochasticWeightAveraging performs an extra step in the end. Consequently the implementation is significantly more complex and it's difficult to make sure that it works in all cases. Should we add this as an option in this class too?

I think this is ok, but my doubt with forcing use_buffers to be true is what happens when a user has a module with buffers in it that are not meant to be averaged. I guess at that point they will probably be the same over time (e.g. the RoPE cache), but that's not really a guarantee.

Wdyt about this? I don't necessarily want to make the implementation more complex, so this is just for discussion.

Updates the average model after every step. StochasticWeightAveraging updates the average model after every epoch, and I recall that the original paper updated it only at certain points (the learning rate minima). I guess it would be nice to be able to select whether the average model will be updated after every step, after every epoch, or after certain epochs. Then we would need only one callback and we could remove the StochasticWeightAveraging callback, but would it make this class too complex?

It would be nice to make it configurable, and probably users will want to get to some minimum and then start averaging. The criteria to do so may be very bespoke, so maybe allowing the user to implement a custom hook to decide whether to start averaging or whether to average at a given step would be super handy. Otherwise I'm expecting users will train for some time, save a checkpoint, then reload with this callback added to the trainer and start averaging. Which is totally fine but it requires you to stop and resume.

Regarding removing the StochasticWeightAveraging callback, I don't necessarily see that happening. We have a pretty strong commitment to backward compatibility at this point, so keeping that in with a notice to just use this one will not hurt.

@senarvi
Copy link
Contributor Author

senarvi commented Jan 15, 2025

I think this is ok, but my doubt with forcing use_buffers to be true is what happens when a user has a module with buffers in it that are not meant to be averaged. I guess at that point they will probably be the same over time (e.g. the RoPE cache), but that's not really a guarantee.

That's a good point. I don't know what would be a good solution.

Updates the average model after every step. StochasticWeightAveraging updates the average model after every epoch, and I recall that the original paper updated it only at certain points (the learning rate minima). I guess it would be nice to be able to select whether the average model will be updated after every step, after every epoch, or after certain epochs. Then we would need only one callback and we could remove the StochasticWeightAveraging callback, but would it make this class too complex?

It would be nice to make it configurable, and probably users will want to get to some minimum and then start averaging. The criteria to do so may be very bespoke, so maybe allowing the user to implement a custom hook to decide whether to start averaging or whether to average at a given step would be super handy. Otherwise I'm expecting users will train for some time, save a checkpoint, then reload with this callback added to the trainer and start averaging. Which is totally fine but it requires you to stop and resume.

That's an interesting idea. We could have the user pass a function update_on_step(global_step) or update_on_epoch(epoch) that returns a boolean. After each optimizer step and after each epoch we would call the function to check whether we should update the average model.

It seems that AveragedModel will copy the current model parameters when called the first time, and update the average on subsequent calls. This means that the first average is computed when update_on_step() or update_on_epoch() returns True for the second time. I don't see a better alternative.

I checked how StochasticWeightAveraging does this and I think it doesn't work correctly. It only ever updates the average model parameters in on_train_epoch_start(), so the average is not updated after the last epoch. Just shows why I'd like to keep the logic as simple as possible.

@cyanic-selkie
Copy link

Hi, I have a couple questions.

  1. You added the on_validation_epoch_start and on_validation_epoch_end hooks to swap the weights, but shouldn't the same happen for test?
  2. In my current workflow I have a separate script that does the model exporting to ONNX. It's short, and really the only Lightning specific thing is the MyLightningModule.load_from_checkpoint(...) method. Since the averaged weights are a part of the callback, I would have to instantiate the trainer for the weights to be loaded. And even then, I wouldn't have a function I could call to explicitly swap the weights (since _swap_weights is private and not really accessible). So, my question is, can we have some sort of an API, outside of the trainer, that can load the averaged weights instead of the regular weights? Perhaps adding some sort of a parameter to the load_from_checkpoint method?

@senarvi
Copy link
Contributor Author

senarvi commented Jan 16, 2025

Hi @cyanic-selkie

During training (stage=fit), the actual LightningModule is what we update using the optimizer (I call it the current model) and an AveragedModel is maintained in the background (I call it the average model).

I assume that validation is only called during training. Before and after validation we swap the current model and the average model, so the average model will be validated.

When saving a checkpoint, we save the average model parameters in the state_dict. So if you later load the checkpoint without WeightAveraging callback and run a test or export to ONNX, you will be using the average parameters.

When training ends, we copy the average model parameters to the current model. So if you run a test or export to ONNX after training, you will be using the average parameters.

That's the idea at least. I'm not confident that I have thought about every possible corner case. It would be great if you could test that it works in your case.

@cyanic-selkie
Copy link

@senarvi Ah! Thanks for the clarification, I should've checked the code out more carefully. I tried your branch out on a quantization aware training enabled model with ONNX export at the end and everything is working beautifully! I hope this gets merged quickly.

@senarvi senarvi force-pushed the generic-weight-averaging branch from efc77dc to 0010492 Compare January 23, 2025 16:07
@senarvi
Copy link
Contributor Author

senarvi commented Jan 23, 2025

The user can now provide either the update_on_step or the update_on_epoch argument. (In theory also both.) It should be a function that takes the step/epoch number and returns True if the average model should be updated at that point of time.

For example:

update_on_step = lambda x: x > 100

or

update_on_epoch = lambda x: x in (3, 5, 7)

Using update_on_epoch, SWA should be possible. I added one unit test for SWA.

I tested EMA in an actual learning task and it gave an improvement, so I'm starting to be more confident that this works.

I think the biggest question that is still left is whether it's a problem that we force use_buffers=True. It would be nice if we could provide the option to instead call update_bn() after training and we wouldn't have to duplicate any of that code. That function takes a data loader and iterates through the data. I can imagine that passing the Trainer's data loader might not work in all cases. We could also leave calling this function to the user.

StochasticWeightAveraging increments the number of epochs in on_fit_start() and during the extra epoch disables the backward pass. I could also copy the code from that class, but there are some details that I don't understand, and I'm not that excited of copying code that I don't fully understand.

@tchaton I think you contributed the StochasticWeightAveraging callback, maybe you have some insight?

* A callback that updates a torch.optim.swa_utils.AveragedModel after specific steps or epochs.
* The user can provide a callback that defines after which steps or epochs the average model is updated.
@senarvi senarvi force-pushed the generic-weight-averaging branch from 5f34205 to c8d50bd Compare January 23, 2025 18:00
@github-actions github-actions bot added the fabric lightning.fabric.Fabric label Jan 23, 2025
@cyanic-selkie
Copy link

Is there anything blocking this from being merged?

@senarvi senarvi changed the title Generic weight averaging callback that supports EMA [wip] Generic weight averaging callback that supports EMA Feb 2, 2025
@senarvi senarvi marked this pull request as ready for review February 2, 2025 21:21
@senarvi
Copy link
Contributor Author

senarvi commented Feb 2, 2025

I marked this ready for review. There were no comments whether it's a problem that we force use_buffers=True. Would it make sense to merge this now and perhaps introduce such option later based on the feedback that we receive?

Copy link

codecov bot commented Feb 2, 2025

Codecov Report

Attention: Patch coverage is 95.23810% with 4 lines in your changes missing coverage. Please review.

Project coverage is 79%. Comparing base (ea59e40) to head (51b9a06).

❗ There is a different number of reports uploaded between BASE (ea59e40) and HEAD (51b9a06). Click for more details.

HEAD has 318 uploads less than BASE
Flag BASE (ea59e40) HEAD (51b9a06)
cpu 96 24
lightning_fabric 13 0
pytest 50 0
python3.9 24 6
lightning 73 18
python3.10 12 3
python3.11 24 6
python3.12.7 36 9
gpu 2 0
pytorch2.1 18 9
pytest-full 48 24
pytorch2.2.2 6 3
pytorch_lightning 12 6
pytorch2.3 6 3
pytorch2.4.1 6 3
pytorch2.5.1 12 6
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #20545     +/-   ##
=========================================
- Coverage      88%      79%     -9%     
=========================================
  Files         267      265      -2     
  Lines       23380    23409     +29     
=========================================
- Hits        20481    18446   -2035     
- Misses       2899     4963   +2064     

Copy link
Collaborator

@lantiga lantiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid contribution @senarvi! I added a few comments (most are quick to address, let me know what you can do here vs follow up PR), but overall looks great.

src/lightning/pytorch/callbacks/weight_averaging.py Outdated Show resolved Hide resolved
checkpoint["state_dict"] = {
name[7:]: value for name, value in average_model_state.items() if name.startswith("module.")
}
checkpoint["averaging_state"] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
checkpoint["averaging_state"] = {
checkpoint["averaged_state"] = {

I get that it might be still "averaging" : ), but it's in fact "averaged" up to the current iterations. We can called it "average" model if it sounds better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lantiga the name is a bit confusing, but it means the state of the averaging process, not the average model. This includes the state variables of the AveragedModel class, excluding the module (i.e. n_averaged). The average model is saved in state_dict, so whatever we'll do with the checkpoint, we'll use the average model. The current model state is saved in current_model_state, so that we can continue training with the WeightAveraging callback from the previous state. If you have a less confusing name for the "averaging state variables, excluding the averaged model parameters", I can change it.


"""
if self._average_model is None:
raise Exception("Trying to load a checkpoint, but no average model (outside fit). Don't know what to do.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to understand for a user if they don't know the details of the callback.

Suggested change
raise Exception("Trying to load a checkpoint, but no average model (outside fit). Don't know what to do.")
raise Exception("Trying to load a checkpoint using the WeightAveraging callback outside the `fit` stage. The WeightAveraging callback can only be used in the `fit` stage.")

I'm wondering: instead of raising we could just load the average model e.g. for predict. This will avoid forcing users to remove the callback from the Trainer.

Copy link
Contributor Author

@senarvi senarvi Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lantiga I guess I just wasn't sure in which situation this callback would be called outside fit, but yes, if the user calls Trainer.validate/test/predict(ckpt_path=...), I believe this will be called and the best thing to do would be to load the average model. The average model will be loaded if we don't do anything. Maybe just display a warning in that case.

I guess on_save_checkpoint() can also be called outside fit - if the user calls Trainer.save_checkpoint() after training. In that case we also don't have to do anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lantiga This is what I did. Please check if you think the messages are clear now.

assert trainer.lightning_module == model


def _train_and_resume(tmp_path: str, crash_on_epoch: int, use_ddp: bool = False) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests that we can crash and resume, but afaict it doesn't test whether the resulting averaging is equivalent. We can harden this in a subsequent PR, but it is important to know for sure that averaging works if I stop training and resume while averages are being taken, irrespective of where I stop and resume in the lifecycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to still improve the test.

@lantiga
Copy link
Collaborator

lantiga commented Feb 3, 2025

BTW: I think it's totally fine to merge this as is and open an issue to gather discussions about averaging buffers.

The other question I have (for the future) is related to fitting both models on GPU. It may make sense to give the ability to keep the AveragedModel on a different device (e.g. cpu) to keep the callback usable with larger models.

@senarvi
Copy link
Contributor Author

senarvi commented Feb 4, 2025

The other question I have (for the future) is related to fitting both models on GPU. It may make sense to give the ability to keep the AveragedModel on a different device (e.g. cpu) to keep the callback usable with larger models.

There's a device argument already, and actually the default is cpu - as with StochasticWeightAveraging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add feature Exponential Moving Average (EMA)
4 participants