resume_from_checkpoint failed when using PEFT LORA #35850

HenryYueHY · 2025-01-23T08:46:57Z

System Info

transformers version: 4.43.1
Platform: Linux-6.5.0-45-generic-x86_64-with-glibc2.35
Python version: 3.10.14
Huggingface_hub version: 0.24.7
Safetensors version: 0.4.5
Accelerate version: 1.1.0
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.4.1+cu124 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA H100 NVL

Who can help?

@muellerzr @SunMarc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I am training my own model (a vision-language model) with LoRA using the Trainer. However, when I try to use resume_from_checkpoint, I encounter the following error:

Traceback (most recent call last):
  File "/dataT0/Free/hengyue/dataF0_save/new/LMM/training_Trainer.py", line 137, in <module>
    trainer.train(resume_from_checkpoint=True)
  File "/dataF0/Free/hengyue/miniconda3/envs/doraemon24/lib/python3.10/site-packages/transformers/trainer.py", line 2143, in train
    self._load_from_checkpoint(resume_from_checkpoint)
  File "/dataF0/Free/hengyue/miniconda3/envs/doraemon24/lib/python3.10/site-packages/transformers/trainer.py", line 2851, in _load_from_checkpoint
    if len(active_adapters) > 1:
TypeError: object of type 'method' has no len()

My Code Shows below:

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, 
    inference_mode=False,         
    r=8,                     
    lora_alpha=32,              
    lora_dropout=0.05,       
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], 
)

model = MyModel()
model = get_peft_model(model, lora_config)

model.print_trainable_parameters()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_dataset = Mydataset()

val_dataset = Valdataset()


training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=1,
    num_train_epochs=11,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_total_limit=2,
    report_to="wandb",
    dataloader_num_workers = 4,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=custom_collate_fn,
)

trainer.train(resume_from_checkpoint=True)
trainer.save_state()
trainer.save_model("./test_model")

Expected behavior

Expected the model should continue training from the checkpoint.

The text was updated successfully, but these errors were encountered:

SergeiGoetheScriabin · 2025-01-23T17:45:31Z

try updating transformers and accelerate to their latest versions using pip.

HenryYueHY added the bug label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resume_from_checkpoint failed when using PEFT LORA #35850

resume_from_checkpoint failed when using PEFT LORA #35850

HenryYueHY commented Jan 23, 2025

SergeiGoetheScriabin commented Jan 23, 2025 •

edited

Loading

resume_from_checkpoint failed when using PEFT LORA #35850

resume_from_checkpoint failed when using PEFT LORA #35850

Comments

HenryYueHY commented Jan 23, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

SergeiGoetheScriabin commented Jan 23, 2025 • edited Loading

SergeiGoetheScriabin commented Jan 23, 2025 •

edited

Loading