Error when enabling gradient checkpointing #8

lololololoki · 2025-02-06T03:12:05Z

Hi, thank you for your outstanding work!

When I try to reproduce the results with gradient checkpointing enabled, I consistently encounter the following error. However, everything runs fine when gradient checkpointing is disabled.
Have you encountered a similar issue before?

Error message:

[rank0]:     position_ids, rope_deltas = self.get_rope_index(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/miniconda3/envs/openr1_multimodal/lib/python3.12/site-packages/transformers/models/qwen2_vl/modeling_qwen2_
vl.py", line 1456, in get_rope_index
[rank0]:     input_ids = input_ids[attention_mask[i] == 1]
[rank0]:                 ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: IndexError: The shape of the mask [588] at index 0 does not match the shape of the indexed tensor [587] at index 0

Command:

export WANDB_MODE=offline
export CUDA_LAUNCH_BLOCKING=1

ARNOLD_WORKER_GPU=8
ARNOLD_WORKER_NUM=1
ARNOLD_ID=0
METIS_WORKER_0_HOST=127.0.0.1
port_in_cmd=12345

torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" \
    --nnodes="${ARNOLD_WORKER_NUM}" \
    --node_rank="${ARNOLD_ID}" \
    --master_addr="${METIS_WORKER_0_HOST}" \
    --master_port="${port_in_cmd}" \
    src/open_r1/grpo.py \
    --output_dir checkpoints/Qwen2-VL-2B-GRPO-8k \
    --model_name_or_path /model_pools/Qwen2-VL-2B-Instruct \
    --dataset_name lmms-lab/multimodal-open-r1-8k-verified \
    --deepspeed scripts/zero3.json \
    --max_prompt_length 8192 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --logging_steps 1 \
    --bf16 \
    --report_to wandb \
    --gradient_checkpointing true \
    --attn_implementation flash_attention_2 \
    --max_pixels 2359296 \
    --save_total_limit 8 \
    --num_train_epochs 1 \
    --num_generations 2 \
    --run_name Qwen2-VL-2B-GRPO-8k

Environment:
- Here are the relevant package versions. I suspect the issue might be related to their versions. Could you share the versions that worked for you?

accelerate               1.3.0
torch                    2.6.0
transformers             4.49.0.dev0
tokenizers               0.21.0
triton                   3.2.0
trl                      0.15.0.dev0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when enabling gradient checkpointing #8

Error when enabling gradient checkpointing #8

lololololoki commented Feb 6, 2025 •

edited

Loading

Error when enabling gradient checkpointing #8

Error when enabling gradient checkpointing #8

Comments

lololololoki commented Feb 6, 2025 • edited Loading

lololololoki commented Feb 6, 2025 •

edited

Loading