Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<image> cannot be at the end. #43

Closed
ZhangGongjie opened this issue Feb 14, 2025 · 8 comments
Closed

<image> cannot be at the end. #43

ZhangGongjie opened this issue Feb 14, 2025 · 8 comments

Comments

@ZhangGongjie
Copy link

cannot be at the end. However, in Qwen2_5_VLProcessor, the image token is append at the end. Will this cause sub-optimal performance?

@2U1
Copy link
Owner

2U1 commented Feb 14, 2025

I don't think it matters where th <image> token is. Does an error occur when you put your image token in the end?
Also, what I remeber is that when I tokenized the input, the tokens (<|image_pad|>) were located at the point where I've located the image token.

@ZhangGongjie
Copy link
Author

ZhangGongjie commented Feb 14, 2025

I think I figured out what's going on here.

When <image> is put at the end, it is \n<image> instead of <image>\n. Therefore, in such cases, the llava image token (<image>) won't be replaced by Qwen image tokens.

It could be fixed by replacing both "\n<image>" and "<image>\n" with Qwen image tokens.

@2U1
Copy link
Owner

2U1 commented Feb 14, 2025

Oh I haven't thought about putting the image at the end. Most of the case image was added at the first or at the middle.

@ZhangGongjie
Copy link
Author

Yeah, it depends on whether your prompt or your image contains more information. I am doing OCR so I put the image at the end.

@ZhangGongjie
Copy link
Author

It seems in transformer's qwen2.5vl, it also put image token at the end.

(Pdb) prompt

'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n\nPlease parse the given exam paper image in yaml format.<|vision_start|><|image_pad|><|vision_end|><|im_end|>\n<|im_start|>assistant\n'

@2U1
Copy link
Owner

2U1 commented Feb 14, 2025

Actually the location of the image token is decided by the user.

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

If you are doing something like this, then the image token would be before the text.
The example was from the officail repo of Qwen2.5-VL

@ZhangGongjie
Copy link
Author

indeed! I didn't realize that. Thanks for carefully looking into my question.

Actually the location of the image token is decided by the user.

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

If you are doing something like this, then the image token would be before the text. The example was from the officail repo of Qwen2.5-VL

@2U1
Copy link
Owner

2U1 commented Feb 18, 2025

I've updated the code to handle the <image> token to be at the end.

@2U1 2U1 closed this as completed Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants