Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when enabling gradient checkpointing #8

Open
lololololoki opened this issue Feb 6, 2025 · 0 comments
Open

Error when enabling gradient checkpointing #8

lololololoki opened this issue Feb 6, 2025 · 0 comments

Comments

@lololololoki
Copy link

lololololoki commented Feb 6, 2025

Hi, thank you for your outstanding work!

When I try to reproduce the results with gradient checkpointing enabled, I consistently encounter the following error. However, everything runs fine when gradient checkpointing is disabled.
Have you encountered a similar issue before?

  • Error message:
[rank0]:     position_ids, rope_deltas = self.get_rope_index(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/miniconda3/envs/openr1_multimodal/lib/python3.12/site-packages/transformers/models/qwen2_vl/modeling_qwen2_
vl.py", line 1456, in get_rope_index
[rank0]:     input_ids = input_ids[attention_mask[i] == 1]
[rank0]:                 ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: IndexError: The shape of the mask [588] at index 0 does not match the shape of the indexed tensor [587] at index 0
  • Command:
export WANDB_MODE=offline
export CUDA_LAUNCH_BLOCKING=1

ARNOLD_WORKER_GPU=8
ARNOLD_WORKER_NUM=1
ARNOLD_ID=0
METIS_WORKER_0_HOST=127.0.0.1
port_in_cmd=12345

torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" \
    --nnodes="${ARNOLD_WORKER_NUM}" \
    --node_rank="${ARNOLD_ID}" \
    --master_addr="${METIS_WORKER_0_HOST}" \
    --master_port="${port_in_cmd}" \
    src/open_r1/grpo.py \
    --output_dir checkpoints/Qwen2-VL-2B-GRPO-8k \
    --model_name_or_path /model_pools/Qwen2-VL-2B-Instruct \
    --dataset_name lmms-lab/multimodal-open-r1-8k-verified \
    --deepspeed scripts/zero3.json \
    --max_prompt_length 8192 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --logging_steps 1 \
    --bf16 \
    --report_to wandb \
    --gradient_checkpointing true \
    --attn_implementation flash_attention_2 \
    --max_pixels 2359296 \
    --save_total_limit 8 \
    --num_train_epochs 1 \
    --num_generations 2 \
    --run_name Qwen2-VL-2B-GRPO-8k
  • Environment:
    • Here are the relevant package versions. I suspect the issue might be related to their versions. Could you share the versions that worked for you?
accelerate               1.3.0
torch                    2.6.0
transformers             4.49.0.dev0
tokenizers               0.21.0
triton                   3.2.0
trl                      0.15.0.dev0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant