Getting some strange results with hybrid_search() compared with dense + sparse results #39632
-
Hi, I ask here, in a discussion because, surely I'm doing something wrong and this is not an issue, but my fault. I've been closely following this, using Milvus v2.5.4: https://milvus.io/docs/multi-vector-search.md So I've a "text" field, with the analyser enabled, and also have created the the "dense" (FLOAT_VECTOR) and "sparse" (SPARSE_FLOAT_VECTOR) fields and the function as of the links above. So I created the collection with the fields, the function and the 2 indexes, then inserted all the texts. Then I performed a few "normal" searches ( So, then, last step, I tried hybrid ( For some reason, all the results that I'm getting are from the last document indexed and not a mix/fusion/rerank of the "normal" searches (that, as said, look correct). BTW, that last document indexed only matches the sparse search minimally (it's way down in the results). Can anybody imagine why my hybrid search is returning information exclusively from only the last indexed text? I really was expecting it to behave like the "normal" ones, with the reranker doing its work but always matching what the "normal" ones (dense and sparse) return. Thanks in advance and ciao :-) |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 7 replies
-
most likely this is a misuse the api. |
Beta Was this translation helpful? Give feedback.
-
I've been unable (grrr) to reproduce it with a small dataset. In any case, here it is the exact code that I'm using (note it comes from a Jupyter notebook). import ipynbname
import json
from langchain_openai import OpenAIEmbeddings
from pymilvus import Function, FunctionType, AnnSearchRequest
from pymilvus import WeightedRanker, RRFRanker
from pymilvus import MilvusClient, FieldSchema, CollectionSchema, DataType
# Some silly data to test everything.
pages = [
"Security is very important",
"How to secure your computer",
"How to secure your phone",
"Protect yourself from cyber attacks",
"How to protect your data",
"Good security habits help",
"The sky is blue",
"The moon is white",
"The sun is yellow",
"The grass is green",
]
# We are going to use the bg3-m3 model for dense embeddings.
embedding_model = "bge-m3" # Ollama's OpenAI embeddings.
embedding_dimension = 1024 # This is default and cannot change it (Ollama's OpenAI embeddings don't support the dimension parameter).
collection_name = f"{ipynbname.name()}_collection"
milvus = MilvusClient("http://localhost:19530")
if milvus.has_collection(collection_name):
milvus.drop_collection(collection_name)
fields = [
FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, auto_id=True, max_length=100),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=5000, enable_analyzer=True, analyzer_params={"type": "english"}, enable_match=True,),
FieldSchema(name="dense_vector", dtype=DataType.FLOAT_VECTOR, dim=embedding_dimension),
FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR),
]
schema = CollectionSchema(fields)
bm25_function = Function(
name="text_bm25_emb",
input_field_names=["text"], # Input text field
output_field_names=["sparse_vector"], # Internal mapping sparse vector field
function_type=FunctionType.BM25, # Model for processing mapping relationship
)
schema.add_function(bm25_function)
index_params = milvus.prepare_index_params()
index_params.add_index(field_name="dense_vector", index_type="HNSW", metric_type="IP", params={"M": 64, "efConstruction": 100})
index_params.add_index(field_name="sparse_vector", index_type="SPARSE_INVERTED_INDEX", metric_type="BM25", params={"inverted_index_algo": "DAAT_WAND", "drop_ratio_build": 0.2})
milvus.create_collection(collection_name, schema=schema, index_params=index_params)
embeddings = OpenAIEmbeddings(model=embedding_model, dimensions=embedding_dimension)
# Let's index the sections and add them to the Milvus collection.
for page in pages:
print(f"Indexing page: {page}")
dense_embedding = embeddings.embed_documents([page])
data = [
{
"text": page,
"dense_vector": dense_embedding[0],
}
]
milvus.insert(collection_name, data)
milvus.close() Indexing page: Security is very important
Indexing page: How to secure your computer
Indexing page: How to secure your phone
Indexing page: Protect yourself from cyber attacks
Indexing page: How to protect your data
Indexing page: Good security habits help
Indexing page: The sky is blue
Indexing page: The moon is white
Indexing page: The sun is yellow
Indexing page: The grass is green milvus = MilvusClient("http://localhost:19530")
query = "which color is the sky?"
dense_search_params = {
"metric_type": "IP",
"params": {
"ef": 10,
}
}
embedding = embeddings.embed_query(query)
dense_results = milvus.search(
collection_name=collection_name,
data=[embedding],
search_params=dense_search_params,
anns_field="dense_vector",
limit=5, output_fields=[
"id",
"text",
]
)
sparse_search_params = {
"metric_type": "BM25",
"drop_ratio_search": 0.2
}
sparse_results = milvus.search(
collection_name=collection_name,
data=[query],
search_params=sparse_search_params,
anns_field="sparse_vector",
limit=5, output_fields=[
"id",
"text",
]
)
# Let's try the hybrid now, using exactly the same params as the previous searches.
dense_search = AnnSearchRequest(
[embedding], "dense_vector", dense_search_params, limit=10,
)
sparse_search = AnnSearchRequest(
[query], "sparse_vector", sparse_search_params, limit=10,
)
hybrid_results = milvus.hybrid_search(
collection_name,
[dense_search, sparse_search],
RRFRanker(),
limit=5,
output_fields=[
"id",
"text",
]
)
milvus.close()
dense_retrieved = [
(
res["entity"]["id"],
res["entity"]["text"],
res["distance"]
) for res in dense_results[0]
]
print(f"Dense retrieval: {json.dumps(dense_retrieved, indent=2)}")
sparse_retrieved = [
(
res["entity"]["id"],
res["entity"]["text"],
res["distance"]
) for res in sparse_results[0]
]
print(f"Sparse retrieval: {json.dumps(sparse_retrieved, indent=2)}")
hybrid_retrieved = [
(
res["entity"]["id"],
res["entity"]["text"],
res["distance"]
) for res in hybrid_results[0]
]
print(f"Hybrid retrieval: {json.dumps(hybrid_retrieved, indent=2)}") Dense retrieval: [
[
"455758130082550435",
"The sky is blue",
0.7351745963096619
],
[
"455758130082550439",
"The sun is yellow",
0.6092709302902222
],
[
"455758130082550437",
"The moon is white",
0.5868839621543884
],
[
"455758130082550441",
"The grass is green",
0.5120256543159485
],
[
"455758130082550423",
"Security is very important",
0.40660712122917175
]
]
Sparse retrieval: [
[
"455758130082550435",
"The sky is blue",
2.3534746170043945
]
]
Hybrid retrieval: [
[
"455758130082550435",
"The sky is blue",
0.032786883413791656
],
[
"455758130082550439",
"The sun is yellow",
0.016129031777381897
],
[
"455758130082550437",
"The moon is white",
0.01587301678955555
],
[
"455758130082550441",
"The grass is green",
0.015625
],
[
"455758130082550423",
"Security is very important",
0.015384615398943424
]
] In my complete example, the only differences are that:
Will triple check if, maybe, the information that I'm inserting is wrong (they are pieces of plain text of different lengths), but cannot imagine why I'm getting those hybrid strange results... when sparse is looking acceptable. Ciao :-) |
Beta Was this translation helpful? Give feedback.
-
And just for comparison, with exactly the same code than above, but against my big collection, these are the results that I'm getting for the "secure oauth" query. As you can see, all the hybrid ones come from the same (mediawiki) page that is completely unrelated to the query, it's not retrieved by the dense or sparse individual searches and, by coincidence, that page is "alphabetically" the last one that is indexed (out of ~3000 pages). Dense retrieval: [
[
"tool untoken oauth2",
"https://docs.moodle.org/405/en/tool_untoken_oauth2#",
0.6340181827545166
],
[
"Description",
"https://docs.moodle.org/405/en/tool_untoken_oauth2#Description",
0.5650123953819275
],
[
"Installation",
"https://docs.moodle.org/405/en/tool_untoken_oauth2#Installation",
0.5511865019798279
]
]
Sparse retrieval: [
[
"See also",
"https://docs.moodle.org/405/en/report/security/report_security_check_noauth#See_also",
4.536318778991699
],
[
"report/security/report security check webcron",
"https://docs.moodle.org/405/en/report/security/report_security_check_webcron#",
3.627807140350342
]
]
Hybrid retrieval: [
[
"Features",
"https://docs.moodle.org/405/en/wiziq_live_class_module#Features",
0.032522473484277725
],
[
"wiziq live class module",
"https://docs.moodle.org/405/en/wiziq_live_class_module#",
0.032522473484277725
],
[
"See Also",
"https://docs.moodle.org/405/en/wiziq_live_class_module#See_Also",
0.01587301678955555
],
[
"Installation",
"https://docs.moodle.org/405/en/wiziq_live_class_module#Installation",
0.01587301678955555
],
[
"Support",
"https://docs.moodle.org/405/en/wiziq_live_class_module#Support",
0.015625
]
]
´´´ |
Beta Was this translation helpful? Give feedback.
-
Here there is some extra information, thank's for looking! If it's useful, I can share the whole IPYNB somewhere, it's not secret (apart from the OpenAI API base and key needing to be setup in .env or so. Note it can take 1.5h to load and 45m to index (although the problem is reproducible just using the last 50-100 docs instead of the 3000.
Dense retrieval: [
[
"bf428e1d-f221-55de-a77f-a61755a4d727",
"https://docs.moodle.org/405/en/OAuth_2_authentication#",
0.6968940496444702
],
[
"59e06cf8-f390-5093-af2e-3685be593a25",
"https://docs.moodle.org/405/en/OAuth_2_authentication#Setting_up_OAuth_2_authentication",
0.6576463580131531
],
[
"46f64ca6-6094-51fc-bbbe-34e3333c5388",
"https://docs.moodle.org/405/en/Session_Booking#Outlook_Live_calendar",
0.6562259197235107
],
[
"391ada15-580c-5baa-b16f-eeb35d9b1122",
"https://docs.moodle.org/405/en/auth_oidc#Settings",
0.582125723361969
],
[
"996ad860-2a9a-504f-8861-aeafd0b2ae29",
"https://docs.moodle.org/405/en/auth/dev#Information",
0.5729366540908813
]
]
Sparse retrieval: [
[
"391ada15-580c-5baa-b16f-eeb35d9b1122",
"https://docs.moodle.org/405/en/OAuth_2_authentication#See_also",
10.065633773803711
],
[
"b7d55bf4-7057-5113-85c8-141871bf7635",
"https://docs.moodle.org/405/en/OAuth_2_services#How_do_I_get_a_client_ID_and_secret?",
9.582510948181152
],
[
"a4c08562-50fa-5599-939c-eb6f2a83a362",
"https://docs.moodle.org/405/en/ownCloud_Repository#Acknowledgement",
9.474544525146484
],
[
"d35aeaf3-5d1d-535a-a31a-22133ddf5f3d",
"https://docs.moodle.org/405/en/OAuth_2_services#See_also",
9.42530345916748
],
[
"22fe83ae-a20f-54fc-b436-cec85c94c5e8",
"https://docs.moodle.org/405/en/Publish_as_LTI_tool#Register_Moodle_with_the_platform",
8.989785194396973
]
]
Hybrid retrieval: [
[
"391ada15-580c-5baa-b16f-eeb35d9b1122",
"https://docs.moodle.org/405/en/wiziq_live_class_module#Firewall_Rules",
0.03154495730996132
],
[
"59e06cf8-f390-5093-af2e-3685be593a25",
"https://docs.moodle.org/405/en/wiziq_live_class_module#Installation",
0.03128054738044739
],
[
"bf428e1d-f221-55de-a77f-a61755a4d727",
"https://docs.moodle.org/405/en/wiziq_live_class_module#",
0.03109932318329811
],
[
"d35aeaf3-5d1d-535a-a31a-22133ddf5f3d",
"https://docs.moodle.org/405/en/umm:_Unofficial_Moodle_Mobile_app#Translating",
0.030330881476402283
],
[
"996ad860-2a9a-504f-8861-aeafd0b2ae29",
"https://docs.moodle.org/405/en/wiziq_live_class_module#Features",
0.03030998818576336
]
]
{ 'aliases': [],
'auto_id': False,
'collection_id': 455758130084527984,
'collection_name': 'mediawiki_api_collection',
'consistency_level': 2,
'description': '',
'enable_dynamic_field': False,
'fields': [ { 'description': '',
'field_id': 100,
'is_primary': True,
'name': 'id',
'params': {'max_length': 100},
'type': <DataType.VARCHAR: 21>},
{ 'description': '',
'field_id': 101,
'name': 'title',
'params': {'max_length': 1000},
'type': <DataType.VARCHAR: 21>},
{ 'description': '',
'field_id': 102,
'name': 'text',
'params': { 'analyzer_params': '{"type":"english"}',
'enable_analyzer': 'true',
'enable_match': 'true',
'max_length': 5000},
'type': <DataType.VARCHAR: 21>},
{ 'description': '',
'field_id': 103,
'name': 'source',
'params': {'max_length': 1000},
'type': <DataType.VARCHAR: 21>},
{ 'description': '',
'field_id': 104,
'name': 'dense_vector',
'params': {'dim': 1024},
'type': <DataType.FLOAT_VECTOR: 101>},
{ 'description': '',
'field_id': 105,
'is_function_output': True,
'name': 'sparse_vector',
'params': {},
'type': <DataType.SPARSE_FLOAT_VECTOR: 104>},
{ 'description': '',
'field_id': 106,
'name': 'parent',
'params': {'max_length': 100},
'type': <DataType.VARCHAR: 21>},
{ 'description': '',
'element_type': <DataType.VARCHAR: 21>,
'field_id': 107,
'name': 'children',
'params': {'max_capacity': 50, 'max_length': 2000},
'type': <DataType.ARRAY: 22>},
{ 'description': '',
'element_type': <DataType.VARCHAR: 21>,
'field_id': 108,
'name': 'previous',
'params': {'max_capacity': 50, 'max_length': 2000},
'type': <DataType.ARRAY: 22>},
{ 'description': '',
'element_type': <DataType.VARCHAR: 21>,
'field_id': 109,
'name': 'next',
'params': {'max_capacity': 50, 'max_length': 2000},
'type': <DataType.ARRAY: 22>},
{ 'description': '',
'field_id': 110,
'name': 'doc_id',
'params': {'max_length': 100},
'type': <DataType.VARCHAR: 21>},
{ 'description': '',
'field_id': 111,
'name': 'doc_title',
'params': {'max_length': 1000},
'type': <DataType.VARCHAR: 21>},
{ 'description': '',
'field_id': 112,
'name': 'doc_hash',
'params': {'max_length': 100},
'type': <DataType.VARCHAR: 21>}],
'functions': [ { 'description': '',
'id': 100,
'input_field_ids': [102],
'input_field_names': ['text'],
'name': 'text_bm25_emb',
'output_field_ids': [105],
'output_field_names': ['sparse_vector'],
'params': {},
'type': <FunctionType.BM25: 1>}],
'num_partitions': 1,
'num_shards': 1,
'properties': {}}
{None}
Edited: sorry, copied/pasted the code instead of the results. |
Beta Was this translation helpful? Give feedback.
-
Wow, just for coincidence, I saw that the id How can that be possible? I'm going to triple check my uuid() based id generation to verify if it's leading to dupes, although I was not expecting that (neither I get any message when inserting them). Ciao :-) |
Beta Was this translation helpful? Give feedback.
-
Hi, I'm regenerating everything from scratch, just to closely examine what's happening with those exact chunk insertions. Basically I generate the ids using uuid5("url + [sep] + pageid + [sep] + sectionid") that supposedly is unique (sha1, from memory)... and cannot understand 1) why it's not or 2) why insertions don't fail. I'll share here any finding... thanks @yhmo ! |
Beta Was this translation helpful? Give feedback.
-
Wow, confirmed, somehow my uuid5() generation is leading to duplicates, will examine that later. Curious thing are 1) that inserts of those dupe ids happen without problems and 2) that different searches agains the same collection returns them differently. In any case, I think that this can be closed as far as I should not be feeding the collection with dupe IDs. Thanks! |
Beta Was this translation helpful? Give feedback.
These ids also duplicated:
"59e06cf8-f390-5093-af2e-3685be593a25"
"391ada15-580c-5baa-b16f-eeb35d9b1122"
"d35aeaf3-5d1d-535a-a31a-22133ddf5f3d"
"996ad860-2a9a-504f-8861-aeafd0b2ae29"