-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About Siglip2 feature selection #36382
Comments
Hey @MonolithFoundation, thanks for the question. You can use The mask has shape (batch_size, max_num_patches) with 0s for padded patches (tokens) |
Hi, would like consult a little further: I want using siglip2 as a MLLM vision backbone, I would like using 1024 as max_patches, but I got some issues:
|
Hey @MonolithFoundation, Siglip2 vision position embeddings are resized to match the image size in "patches", also the image can be non-square and with any aspect ratio, e.g. 1024x256 or 256x1024 and that aspect ratio would be preserved. transformers/src/transformers/models/siglip2/modeling_siglip2.py Lines 171 to 227 in 41925e4
Please use https://discuss.huggingface.co/ for the questions, we are trying ot keep issues only for bug reports and feature requests. |
Feature request
Hi, glad to see Siglip2 support.
Wanna consult, since Siglip2 dynamic input (max_num_patches) have padding, does the output need to be selected?
For example, if we have max_num_patches=1024, but there is some padding due to image ratio keeping, the final output 1024 tokens will have some tokens is padding? How to remove them?
Motivation
For downstream tasks such as img understanding.
Your contribution
Would like contribute a PR if it's needed
The text was updated successfully, but these errors were encountered: