About Siglip2 feature selection #36382

MonolithFoundation · 2025-02-25T02:48:52Z

Feature request

Hi, glad to see Siglip2 support.

Wanna consult, since Siglip2 dynamic input (max_num_patches) have padding, does the output need to be selected?

For example, if we have max_num_patches=1024, but there is some padding due to image ratio keeping, the final output 1024 tokens will have some tokens is padding? How to remove them?

Motivation

For downstream tasks such as img understanding.

Your contribution

Would like contribute a PR if it's needed

qubvel · 2025-02-25T15:03:43Z

Hey @MonolithFoundation, thanks for the question. You can use pixel_attention_mask produced by processor, which indicates padded patches.

The mask has shape (batch_size, max_num_patches) with 0s for padded patches (tokens)

MonolithFoundation · 2025-02-26T02:29:28Z

Hi, would like consult a little further:

I want using siglip2 as a MLLM vision backbone, I would like using 1024 as max_patches, but I got some issues:

how could I shrink down the tokens feed into LLM? I tried used something like VLPatchMerger like in Qwen2VL, it learned something but not learned very well;
If set maxpatches into a fixed number, such as 1024, everysingle image will actually being resized into that target size, how could this able to be dynamic? (therotically it's same as resize into a 512x512 size)

qubvel · 2025-02-26T10:42:55Z

Hey @MonolithFoundation, Siglip2 vision position embeddings are resized to match the image size in "patches", also the image can be non-square and with any aspect ratio, e.g. 1024x256 or 256x1024 and that aspect ratio would be preserved.

transformers/src/transformers/models/siglip2/modeling_siglip2.py

Lines 171 to 227 in 41925e4

    
               def resize_positional_embeddings( 
        
                   positional_embeddings: torch.Tensor, 
        
                   spatial_shapes: torch.LongTensor, 
        
                   max_length: int, 
        
               ) -> torch.Tensor: 
        
                   """ 
        
                   Resize positional embeddings to image-specific size and pad to a fixed size. 
        
                   Args: 
        
                       positional_embeddings (`torch.Tensor`): 
        
                           Position embeddings of shape (height, width, embed_dim) 
        
                       spatial_shapes (`torch.LongTensor`): 
        
                           Spatial shapes of shape (batch_size, 2) to resize the positional embeddings to 
        
                       max_length (`int`): 
        
                           Maximum length of the positional embeddings to pad resized positional embeddings to 
        
                   Returns: 
        
                       `torch.Tensor`: Embeddings of shape (batch_size, max_length, embed_dim) 
        
                   """ 
        
                   batch_size = spatial_shapes.shape[0] 
        
                   embed_dim = positional_embeddings.shape[-1] 
        
                   source_dtype = positional_embeddings.dtype 
        
                   resulted_positional_embeddings = torch.empty( 
        
                       (batch_size, max_length, embed_dim), 
        
                       device=positional_embeddings.device, 
        
                       dtype=source_dtype, 
        
                   ) 
        
                   # (height, width, embed_dim) -> (1, embed_dim, height, width) for interpolation 
        
                   positional_embeddings = positional_embeddings.permute(2, 0, 1).unsqueeze(0) 
        
                   # Upcast to float32 on CPU because antialias is not supported for bfloat16/float16 on CPU 
        
                   if positional_embeddings.device.type == "cpu": 
        
                       positional_embeddings = positional_embeddings.to(torch.float32) 
        
                   for i in range(batch_size): 
        
                       # (1, dim, height, width) -> (1, dim, target_height, target_width) 
        
                       height, width = spatial_shapes[i] 
        
                       resized_embeddings = F.interpolate( 
        
                           positional_embeddings, 
        
                           size=(height, width), 
        
                           mode="bilinear", 
        
                           align_corners=False, 
        
                           antialias=True, 
        
                       ) 
        
                       # (1, dim, target_height, target_width) -> (target_height * target_width, dim) 
        
                       resized_embeddings = resized_embeddings.reshape(embed_dim, height * width).transpose(0, 1) 
        
                       # Cast to original dtype 
        
                       resized_embeddings = resized_embeddings.to(source_dtype) 
        
                       resulted_positional_embeddings[i, : height * width] = resized_embeddings 
        
                       resulted_positional_embeddings[i, height * width :] = resized_embeddings[0] 
        
                   return resulted_positional_embeddings

Please use https://discuss.huggingface.co/ for the questions, we are trying ot keep issues only for bug reports and feature requests.

MonolithFoundation added the Feature request Request for a new feature label Feb 25, 2025

qubvel added Vision and removed Feature request Request for a new feature labels Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Siglip2 feature selection #36382

About Siglip2 feature selection #36382

MonolithFoundation commented Feb 25, 2025

qubvel commented Feb 25, 2025 •

edited

Loading

MonolithFoundation commented Feb 26, 2025

qubvel commented Feb 26, 2025 •

edited

Loading

About Siglip2 feature selection #36382

About Siglip2 feature selection #36382

Comments

MonolithFoundation commented Feb 25, 2025

Feature request

Motivation

Your contribution

qubvel commented Feb 25, 2025 • edited Loading

MonolithFoundation commented Feb 26, 2025

qubvel commented Feb 26, 2025 • edited Loading

qubvel commented Feb 25, 2025 •

edited

Loading

qubvel commented Feb 26, 2025 •

edited

Loading