Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: CustomTrainer.compute_loss() got an unexpected keyword argument 'num_items_in_batch' #36331

Open
2 of 4 tasks
ruidazeng opened this issue Feb 21, 2025 · 7 comments · May be fixed by #36426
Open
2 of 4 tasks
Labels

Comments

@ruidazeng
Copy link

System Info

  • transformers version: 4.50.0.dev0
  • Platform: Linux-5.15.0-210.163.7.el8uek.x86_64-x86_64-with-glibc2.35
  • Python version: 3.10.16
  • Huggingface_hub version: 0.29.1
  • Safetensors version: 0.5.2
  • Accelerate version: 1.4.0
  • Accelerate config: not found
  • DeepSpeed version: 0.16.3
  • PyTorch version (GPU?): 2.6.0+cu124 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: NO
  • Using GPU in script?: YES
  • GPU type: NVIDIA A100-SXM4-80GB

Who can help?

@muellerzr @SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

My code:

trainer = CustomTrainer(
        model=model,
        train_dataset=torch_format_dataset,
        eval_dataset=torch_format_dataset,
        args=training_args,
        data_collator=custom_data_collator,
    )
    model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
    trainer.train()

Error Info:

[2025-02-20 19:14:49,033] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
num_devices: 1
max_steps: 1250
/opt/saturncloud/envs/tofu/lib/python3.10/site-packages/transformers/training_args.py:1609: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2025-02-20 19:14:50,775] [INFO] [comm.py:652:init_distributed] cdb=None
[2025-02-20 19:14:50,775] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[2025-02-20 19:14:50,903] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2025-02-20 19:14:51,687] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 341, num_elems = 1.42B
[2025-02-20 19:15:23,644] [WARNING] [engine.py:1244:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
Parameter Offload: Total persistent parameters: 544768 in 194 params
  0%|                                                                                                                                                                                                                                                                | 0/1250 [00:00<?, ?it/s]Error executing job with overrides: ['split=full', 'batch_size=4', 'gradient_accumulation_steps=4', 'model_family=phi', 'lr=2e-5']
Traceback (most recent call last):
  File "/home/jovyan/mu-benchmark/finetune.py", line 125, in main
    trainer.train()
  File "/opt/saturncloud/envs/tofu/lib/python3.10/site-packages/transformers/trainer.py", line 2243, in train
    return inner_training_loop(
  File "/opt/saturncloud/envs/tofu/lib/python3.10/site-packages/transformers/trainer.py", line 2554, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/opt/saturncloud/envs/tofu/lib/python3.10/site-packages/transformers/trainer.py", line 3704, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
TypeError: CustomTrainer.compute_loss() got an unexpected keyword argument 'num_items_in_batch'

Data Set: https://huggingface.co/datasets/locuslab/TOFU

Expected behavior

Why would there be a compute_loss() error? I never gave a num_items_in_batch argument.

@ruidazeng ruidazeng added the bug label Feb 21, 2025
@SunMarc
Copy link
Member

SunMarc commented Feb 21, 2025

how is your compute loss defined ? we changed a couple of things with compute loss recently in Trainer and now it requires to have this new arg which is indeed a bit breaking. the issue you have is that it is being called when training to model here:
loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch) cc @muellerzr

@ruidazeng
Copy link
Author

how is your compute loss defined ? we changed a couple of things with compute loss recently in Trainer and now it requires to have this new arg which is indeed a bit breaking. the issue you have is that it is being called when training to model here: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch) cc @muellerzr

def compute_loss(self, model, inputs, return_outputs=False):
        input_ids, labels, attention_mask = inputs
        # forward pass
        outputs = model(input_ids,labels=labels, attention_mask=attention_mask)
        # logits = outputs.get("logits")
        loss = outputs.loss
        # # compute custom loss (suppose one has 3 labels with different weights)
        # loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0], device=model.device))
        # loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

@bialatoheeb
Copy link

bialatoheeb commented Feb 25, 2025

@muellerz @SunMarc
I also encountered this issue with the latest version of transformer=4.49.0. Adding the num_items_in_batch=None
as an argument in my custom loss function fixed it.

@ruidazeng
Copy link
Author

@muellerz @SunMarc I also encountered this issue with the latest version of transformer=4.49.0. Adding the num_items_in_batch=None as an argument in my custom loss function fixed it.

how is your compute loss defined?

@bialatoheeb
Copy link

@muellerz @SunMarc I also encountered this issue with the latest version of transformer=4.49.0. Adding the num_items_in_batch=None as an argument in my custom loss function fixed it.

how is your compute loss defined?

 def compute_loss(
        self, model, inputs, num_items_in_batch=None, return_outputs=False
    ):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs[0]
        ......

@SunMarc
Copy link
Member

SunMarc commented Feb 26, 2025

Could you try the PR above @bialatoheeb @ruidazeng without your changes ?

@SunMarc
Copy link
Member

SunMarc commented Feb 26, 2025

This should work only if you specified compute_loss_func. It won't work if you overwrite compute_loss method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants