-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qwen2.5vl ft #36
Comments
Does the latest code still has the problem? I've tested last week and it was okay then. |
I will try it now. Maybe the code is not the latest. |
Let me know if it still has the issue. |
I met same problem, and I cloned codes on 2/10/2025. my data json looks like: |
I changed data.py code line 288 to "batch_second_per_grid_ts.extend([torch.tensor(ts) for ts in example["second_per_grid_ts"]])" File "/opt/conda/envs/qwen2/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1627, in get_rope_index This problem may caused by qwen2_5_vl code in transformers, could you create a monkey function fixing it? Thanks a lot! |
@Godheritage Sorry, I've noticed I tested video only on Qwen2-VL not 2.5. I'll take a look. |
@Godheritage I've fixed the code. It works fine now. |
@2U1 Thanks for your quick response and generous work, I have succeeded in finetuning Qwen2.5 VL with my videos now. |
Thank you for the update. The video data training is running fine. However, a new issue has arisen: I'm trying to train using |
@SFTJBD You need to use zero2 to finetune video+image. It's a bit tricky to use it with zero3 so I'm working on it. |
@SFTJBD I've updated the code to support zero3 with all kind of mixed-modality data. |
Thank you for the update. I am trying to use both video and image data within the same 'value' field in my data. Does the current code support this functionality? (Similar to: "conversations": [ { "from": "human", "value": "<video>\n<image>\n My question." }, ]) |
@SFTJBD I wasn't cosidering the case for that. The code should be fixed for using in the case. |
OK, I got it. So, putting video and image data within multi-turn conversations will also encounter the same problem, right? (I've made simple modifications to the processor's input to ensure it can handle both modalities simultaneously, but it's still not running properly during training. Could this be a problem caused by DeepSpeed?) |
@SFTJBD Yes, it's likely a DeepSpeed issue. To resolve it, you'll need to directly modify the forward function (in the monkey patching file). When using DeepSpeed, all GPUs must have the same CUDA graph, meaning every GPU needs to pass through the visual processing the same number of times. However, if video and image data are mixed in one batch, some GPUs might go through the visual pathway twice, while in batches with only a single image (or in text-only cases due to a dummy), they only pass through once. Ensuring that the visual processing happens twice in every batch should allow it to work correctly. |
@SFTJBD I'll make a quick fix with it soon. Thanks for letting me know. |
@SFTJBD I can't test it right now but
Maybe this could reslove the problem for you. |
Thanks for the update supporting Qwen2.5-VL! The video data preparation is as you described, but I don't know why it's throwing an error.
in /training/data.py
pixel_values = torch.cat(batch_pixel_values, dim=0)
[rank0]: TypeError: expected Tensor as element 0 in argument 0, but got list
The text was updated successfully, but these errors were encountered: