Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for dimensions field like in OpenAI text-embedding-3, thanks #476

Closed
ericg108 opened this issue Nov 21, 2024 · 8 comments
Closed
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@ericg108
Copy link

Feature request

when initializing with SentenceTransformers, we can use the truncate_dim argument, like below:
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1", truncate_dim=dimensions)

and in calling OpenAI text-embedding-3, we can also pass a `` argument to get variant-length embeddings

dimensions integer Optional The number of dimensions the resulting output embeddings should have. Only supported in text-embedding-3 and later models.
see also: https://platform.openai.com/docs/api-reference/embeddings/create#embeddings-create-dimensions

Motivation

more and more embedding models are supporting Matryoshka embeddings, namely allowing users to get dimensions of varying length, like mxbai-embed-large-v1, jina-embeddings-v3 etc.
this is very useful in scenarios with limited resources. hope it could be supported. Thanks.

Your contribution

I guess it's not a big modification. I may be able to add this feature when I'm told where to modify. Thanks.

@michaelfeil
Copy link
Owner

michaelfeil commented Nov 21, 2024

Hey, you are welcome to work on this. I have seen you never contributed, so here is a

Add a matryoshka_dim param:

This needs modifications in batch handler for embed,audio,video.

There is also engine + syncengine, which needs three times the integration for above.
https://github.com/michaelfeil/infinity/blob/main/libs/infinity_emb/infinity_emb/engine.py
-> And a test for each of them in test_engine.py

-> Add as test in tests/end-to-end/test_dummyengine.py (will only be based on numpy model)
-> add test with openai client compatability. tests/end-to-end/test_ openai_compat.py

You also need to make the parameters / signature available to matryoska dim.
https://github.com/michaelfeil/infinity/blob/main/libs/infinity_emb/infinity_emb/sync_engine.py

Support in the pydantic model:

Should be a small code change (~50 LOC in 10 files), but its the first time adding a request time parameter and therefore needs extensive testing.
I think its easy, but I would enjoy if more people contribute! Let me know if this is helpful to get started.

@michaelfeil michaelfeil added help wanted Extra attention is needed good first issue Good for newcomers labels Nov 21, 2024
@michaelfeil michaelfeil reopened this Nov 21, 2024
@ericg108
Copy link
Author

okay. Let me work on this.
But I'm not familiar with git operations which may take me some time to file a merge request.

@michaelfeil
Copy link
Owner

It will make a lot of sense to first know the basics of git & the feature branch pattern, as well as being familiar with inheritance & unit testing in Python.

@wirthual
Copy link
Collaborator

@ericg108 with the recent merge to master, you should be able now to truncate the returned embeddings by passing in a dimension params just like the openAI api.

Let me know if that works for you.

@ericg108
Copy link
Author

ericg108 commented Jan 8, 2025

@wirthual thank you for your effort. It took me a lot time to learn the codebase and still I didn't find the place to modify the SentenceTranformer initialization.
From your code, I see just slicing the dimension to get the matryoshka embeddings which I didn't expect. Does that mean the underlying implementation is not based on the sentencetransformer so we don't need to modify the SentenceTranformer initialization by passing the argument truncate_dim?
Thanks!

@michaelfeil
Copy link
Owner

michaelfeil commented Jan 8, 2025

@ericg108 yes, that's the feature you requested.
It's identical to what "big" inference providers, text-embedding-3 at OpenAI and the original matryoshka paper do. The reason why it works, is that the model was trained to sort the output by importance.

@ericg108
Copy link
Author

ericg108 commented Jan 8, 2025

@michaelfeil got it, thanks!

@wirthual
Copy link
Collaborator

wirthual commented Jan 9, 2025

Yes, its the same as passing truncate_dim to sentence_transformers . You can see the truncate_embeddings function which is called here is simply doing:

return embeddings[..., :truncate_dim]

However if you managed to pass truncate_dims to Sentence Transformers, this might be of interest for issue #482

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants