Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resume_from_checkpoint failed when using PEFT LORA #35850

Open
2 of 4 tasks
HenryYueHY opened this issue Jan 23, 2025 · 1 comment
Open
2 of 4 tasks

resume_from_checkpoint failed when using PEFT LORA #35850

HenryYueHY opened this issue Jan 23, 2025 · 1 comment
Labels

Comments

@HenryYueHY
Copy link

System Info

  • transformers version: 4.43.1
  • Platform: Linux-6.5.0-45-generic-x86_64-with-glibc2.35
  • Python version: 3.10.14
  • Huggingface_hub version: 0.24.7
  • Safetensors version: 0.4.5
  • Accelerate version: 1.1.0
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_GPU
    - mixed_precision: no
    - use_cpu: False
    - debug: False
    - num_processes: 2
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: all
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: False
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.4.1+cu124 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA H100 NVL

Who can help?

@muellerzr @SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am training my own model (a vision-language model) with LoRA using the Trainer. However, when I try to use resume_from_checkpoint, I encounter the following error:

Traceback (most recent call last):
  File "/dataT0/Free/hengyue/dataF0_save/new/LMM/training_Trainer.py", line 137, in <module>
    trainer.train(resume_from_checkpoint=True)
  File "/dataF0/Free/hengyue/miniconda3/envs/doraemon24/lib/python3.10/site-packages/transformers/trainer.py", line 2143, in train
    self._load_from_checkpoint(resume_from_checkpoint)
  File "/dataF0/Free/hengyue/miniconda3/envs/doraemon24/lib/python3.10/site-packages/transformers/trainer.py", line 2851, in _load_from_checkpoint
    if len(active_adapters) > 1:
TypeError: object of type 'method' has no len()

My Code Shows below:

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, 
    inference_mode=False,         
    r=8,                     
    lora_alpha=32,              
    lora_dropout=0.05,       
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], 
)

model = MyModel()
model = get_peft_model(model, lora_config)

model.print_trainable_parameters()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_dataset = Mydataset()

val_dataset = Valdataset()


training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=1,
    num_train_epochs=11,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_total_limit=2,
    report_to="wandb",
    dataloader_num_workers = 4,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=custom_collate_fn,
)

trainer.train(resume_from_checkpoint=True)
trainer.save_state()
trainer.save_model("./test_model")

Expected behavior

Expected the model should continue training from the checkpoint.

@HenryYueHY HenryYueHY added the bug label Jan 23, 2025
@SergeiGoetheScriabin
Copy link

SergeiGoetheScriabin commented Jan 23, 2025

try updating transformers and accelerate to their latest versions using pip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants