[Transformers future] Loss Computation for Compatibility with Transformers 4.48.3 #1794

yafshar · 2025-02-23T12:58:37Z

What does this PR do?

Update the GaudiTrainer to better match with with Transformers 4.48.3 with minor enhancement from 4.49.0
- Remove the get_batch_samples_transformers and use the super class function get_batch_samples. The num_items_in_batch is only constant for special case of using examples with --dataset_concatenation
Update llama model loss computation
- Simplified loss computation by using a configurable loss function.
- Replaced the inline loss computation logic with a call to self.loss_function.
- Introduced ForCausalLMContextParallelLoss for context parallel loss computation.
- Updated the init method to set the loss function based on the parallel strategy.
- Ensured compatibility with Transformers 4.48.3 by aligning with its structure and conventions.
Refactor loss computation in context parallel to better match with the upstream transformers
- Replaced _ContextParallelLoss with ContextParallelLossFunction for better clarity and consistency.
- Updated fixed_cross_entropy to use ContextParallelLossFunction for gathering losses across context parallel groups.
- Introduced ForCausalLMContextParallelLoss to handle loss computation for causal language modeling with context parallelism.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

… gradient accumulation issue

- Replaced _ContextParallelLoss with ContextParallelLossFunction for better clarity and consistency. - Updated fixed_cross_entropy to use ContextParallelLossFunction for gathering losses across context parallel groups. - Introduced ForCausalLMContextParallelLoss to handle loss computation for causal language modeling with context parallelism.

- Simplified loss computation by using a configurable loss function. - Replaced the inline loss computation logic with a call to self.loss_function. - Introduced ForCausalLMContextParallelLoss for context parallel loss computation. - Updated the __init__ method to set the loss function based on the parallel strategy. - Ensured compatibility with Transformers 4.48.3 by aligning with its structure and conventions.

yafshar · 2025-02-23T13:09:59Z

@regisss I removed the get_batch_samples_transformers in GaudiTrainer and used the super class function get_batch_samples. The num_items_in_batch is only constant for special cases of using examples with --dataset_concatenation. That can be an extra optimization for later, but it is not a general case.

regisss

I left a few comments. Can you also add in the description of the PR that it brings some changes from Transformers 4.49.0 too?

optimum/habana/transformers/trainer.py

regisss · 2025-02-24T11:21:43Z

optimum/habana/transformers/trainer.py

            for _ in range(total_updates):
                update_step += 1
                num_batches = args.gradient_accumulation_steps if update_step != (total_updates - 1) else remainder
-                batch_samples, num_items_in_batch = self.get_batch_samples_transformers(epoch_iterator, num_batches)
+                batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)


This will probably lead to an error in TRL examples as it is pinned to a quite old version: https://github.com/huggingface/optimum-habana/blob/main/examples/trl/requirements.txt#L1
I'll take a look to see if we can update it.

We would need TRL v0.15 to not have a clash between the two self.get_batch_samples methods of the trainers, but it requires Accelerate >= 0.34.
My suggestion is to keep get_batch_samples_transformers and we can add a TODO saying that this should be removed when the Accelerate dependency is upgraded (which should happen soon, see my comments above).

regisss · 2025-02-24T11:26:14Z

optimum/habana/transformers/trainer.py

@@ -1351,7 +1365,7 @@ def _maybe_log_save_evaluate(self, tr_loss, _grad_norm, model, trial, epoch, ign
            self._globalstep_last_logged = self.state.global_step
            self.store_flos()

-            self.log(logs, start_time=start_time)


This also probably collides with former versions of TRL

I did not look at TRL examples, will check more

Same as #1794 (comment)

regisss · 2025-02-24T11:27:32Z

optimum/habana/transformers/trainer.py

@@ -2615,31 +2636,3 @@ def _zero_model_grad(self, model):
            except TypeError:
                model.zero_grad()
                model._zero_grad_kwargs = {}
-
-    def get_batch_samples_transformers(self, epoch_iterator, num_batches):


Have you checked if there is no regression in terms of throughput? The reason I added it was because of that.

In some cases there is a minor regression, but we can do an optimization later. One simple one is to pre-compute num_items_in_batch as your previous comment before the loop. I can do some profiling and measure the impact. Can you point me to examples you find with noticeable regression

You can try:

RUN_SLOW=1 GAUDI2_CI=1 pytest tests/test_examples.py -v -s -k "test_run_clm_gpt2_single_card"

Note that you'll have to uncomment this line: https://github.com/yafshar/optimum-habana/blob/transformers_future/tests/utils.py#L38
Using GPT2 because the test is much faster than other tests with bigger models.

The regression for this model and test is > 50% difference. I am looking into it. Thanks for the hint

yafshar · 2025-02-24T22:19:40Z

optimum/habana/transformers/trainer.py

-                batch_samples, num_items_in_batch = self.get_batch_samples_transformers(epoch_iterator, num_batches)
+                batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)


I haven't been able to reproduce the issue. However, I noticed that others have reported similar problems huggingface/trl#2275

what about changing that

batch_samples, num_items_in_batch = super(Trainer, self).get_batch_samples(epoch_iterator, num_batches)

That would be a better way of managing it for sure! I guess it will ultimately depend on whether or not we need to override get_batch_samples to avoid throughput regressions (linked to the comment above).

yafshar added 4 commits February 21, 2025 14:49

Update the trainer to fix the Gradient Accumulation issue

fa7e3e8

Update loss computation to align with transformers update and address…

464ef8f

… gradient accumulation issue

yafshar marked this pull request as ready for review February 23, 2025 12:58

yafshar requested review from mandy-li, libinta and regisss as code owners February 23, 2025 12:58

regisss reviewed Feb 24, 2025

View reviewed changes

yafshar force-pushed the transformers_future branch from 09298b6 to 7f32580 Compare February 24, 2025 18:56

yafshar commented Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Transformers future] Loss Computation for Compatibility with Transformers 4.48.3 #1794

[Transformers future] Loss Computation for Compatibility with Transformers 4.48.3 #1794

yafshar commented Feb 23, 2025 •

edited

Loading

yafshar commented Feb 23, 2025 •

edited

Loading

regisss left a comment

regisss Feb 24, 2025

regisss Feb 24, 2025

regisss Feb 24, 2025

yafshar Feb 24, 2025

regisss Feb 24, 2025

regisss Feb 24, 2025

yafshar Feb 24, 2025

regisss Feb 24, 2025

yafshar Feb 24, 2025

yafshar Feb 24, 2025

regisss Feb 25, 2025

		batch_samples, num_items_in_batch = self.get_batch_samples_transformers(epoch_iterator, num_batches)
		batch_samples, num_items_in_batch = self.get_batch_samples(epoch_iterator, num_batches)

[Transformers future] Loss Computation for Compatibility with Transformers 4.48.3 #1794

Are you sure you want to change the base?

[Transformers future] Loss Computation for Compatibility with Transformers 4.48.3 #1794

Conversation

yafshar commented Feb 23, 2025 • edited Loading

What does this PR do?

Before submitting

yafshar commented Feb 23, 2025 • edited Loading

regisss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yafshar commented Feb 23, 2025 •

edited

Loading

yafshar commented Feb 23, 2025 •

edited

Loading