Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Siglip2 feature selection #36382

Open
MonolithFoundation opened this issue Feb 25, 2025 · 3 comments
Open

About Siglip2 feature selection #36382

MonolithFoundation opened this issue Feb 25, 2025 · 3 comments
Labels

Comments

@MonolithFoundation
Copy link

Feature request

Hi, glad to see Siglip2 support.

Wanna consult, since Siglip2 dynamic input (max_num_patches) have padding, does the output need to be selected?

For example, if we have max_num_patches=1024, but there is some padding due to image ratio keeping, the final output 1024 tokens will have some tokens is padding? How to remove them?

Motivation

For downstream tasks such as img understanding.

Your contribution

Would like contribute a PR if it's needed

@MonolithFoundation MonolithFoundation added the Feature request Request for a new feature label Feb 25, 2025
@qubvel
Copy link
Member

qubvel commented Feb 25, 2025

Hey @MonolithFoundation, thanks for the question. You can use pixel_attention_mask produced by processor, which indicates padded patches.

The mask has shape (batch_size, max_num_patches) with 0s for padded patches (tokens)

@qubvel qubvel added Vision and removed Feature request Request for a new feature labels Feb 25, 2025
@MonolithFoundation
Copy link
Author

Hi, would like consult a little further:

I want using siglip2 as a MLLM vision backbone, I would like using 1024 as max_patches, but I got some issues:

  • how could I shrink down the tokens feed into LLM? I tried used something like VLPatchMerger like in Qwen2VL, it learned something but not learned very well;
  • If set maxpatches into a fixed number, such as 1024, everysingle image will actually being resized into that target size, how could this able to be dynamic? (therotically it's same as resize into a 512x512 size)

@qubvel
Copy link
Member

qubvel commented Feb 26, 2025

Hey @MonolithFoundation, Siglip2 vision position embeddings are resized to match the image size in "patches", also the image can be non-square and with any aspect ratio, e.g. 1024x256 or 256x1024 and that aspect ratio would be preserved.

def resize_positional_embeddings(
positional_embeddings: torch.Tensor,
spatial_shapes: torch.LongTensor,
max_length: int,
) -> torch.Tensor:
"""
Resize positional embeddings to image-specific size and pad to a fixed size.
Args:
positional_embeddings (`torch.Tensor`):
Position embeddings of shape (height, width, embed_dim)
spatial_shapes (`torch.LongTensor`):
Spatial shapes of shape (batch_size, 2) to resize the positional embeddings to
max_length (`int`):
Maximum length of the positional embeddings to pad resized positional embeddings to
Returns:
`torch.Tensor`: Embeddings of shape (batch_size, max_length, embed_dim)
"""
batch_size = spatial_shapes.shape[0]
embed_dim = positional_embeddings.shape[-1]
source_dtype = positional_embeddings.dtype
resulted_positional_embeddings = torch.empty(
(batch_size, max_length, embed_dim),
device=positional_embeddings.device,
dtype=source_dtype,
)
# (height, width, embed_dim) -> (1, embed_dim, height, width) for interpolation
positional_embeddings = positional_embeddings.permute(2, 0, 1).unsqueeze(0)
# Upcast to float32 on CPU because antialias is not supported for bfloat16/float16 on CPU
if positional_embeddings.device.type == "cpu":
positional_embeddings = positional_embeddings.to(torch.float32)
for i in range(batch_size):
# (1, dim, height, width) -> (1, dim, target_height, target_width)
height, width = spatial_shapes[i]
resized_embeddings = F.interpolate(
positional_embeddings,
size=(height, width),
mode="bilinear",
align_corners=False,
antialias=True,
)
# (1, dim, target_height, target_width) -> (target_height * target_width, dim)
resized_embeddings = resized_embeddings.reshape(embed_dim, height * width).transpose(0, 1)
# Cast to original dtype
resized_embeddings = resized_embeddings.to(source_dtype)
resulted_positional_embeddings[i, : height * width] = resized_embeddings
resulted_positional_embeddings[i, height * width :] = resized_embeddings[0]
return resulted_positional_embeddings

Please use https://discuss.huggingface.co/ for the questions, we are trying ot keep issues only for bug reports and feature requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants