Getting some strange results with hybrid_search() compared with dense + sparse results #39632

stronk7 · 2025-02-04T08:25:52Z

stronk7
Feb 4, 2025

Hi,

I ask here, in a discussion because, surely I'm doing something wrong and this is not an issue, but my fault.

I've been closely following this, using Milvus v2.5.4: https://milvus.io/docs/multi-vector-search.md

So I've a "text" field, with the analyser enabled, and also have created the the "dense" (FLOAT_VECTOR) and "sparse" (SPARSE_FLOAT_VECTOR) fields and the function as of the links above.

So I created the collection with the fields, the function and the 2 indexes, then inserted all the texts.

Then I performed a few "normal" searches (milvus.search()), some using the "dense" and others using the "sparse" fields and, yay, without too much analysis, the results looked ok enough / made sense.

So, then, last step, I tried hybrid (milvus.hybrid_search()), creating a couple of AnnSearchRequest, exactly with the same parameters used in the "normal" searches above, setting the ranker (have tried with both the weights and the RRF ones) and the results that I'm getting are 100% unexpected.

For some reason, all the results that I'm getting are from the last document indexed and not a mix/fusion/rerank of the "normal" searches (that, as said, look correct). BTW, that last document indexed only matches the sparse search minimally (it's way down in the results).

Can anybody imagine why my hybrid search is returning information exclusively from only the last indexed text? I really was expecting it to behave like the "normal" ones, with the reranker doing its work but always matching what the "normal" ones (dense and sparse) return.

Thanks in advance and ciao :-)

Answered by yhmo

Feb 5, 2025

These ids also duplicated:
"59e06cf8-f390-5093-af2e-3685be593a25"
"391ada15-580c-5baa-b16f-eeb35d9b1122"
"d35aeaf3-5d1d-535a-a31a-22133ddf5f3d"
"996ad860-2a9a-504f-8861-aeafd0b2ae29"

View full answer

xiaofan-luan · 2025-02-04T16:11:46Z

xiaofan-luan
Feb 4, 2025
Maintainer

Hi,

I ask here, in a discussion because, surely I'm doing something wrong and this is not an issue, but my fault.

I've been closely following this, using Milvus v2.5.4: https://milvus.io/docs/multi-vector-search.md

So I've a "text" field, with the analyser enabled, and also have created the the "dense" (FLOAT_VECTOR) and "sparse" (SPARSE_FLOAT_VECTOR) fields and the function as of the links above.

So I created the collection with the fields, the function and the 2 indexes, then inserted all the texts.

Then I performed a few "normal" searches (milvus.search()), some using the "dense" and others using the "sparse" fields and, yay, without too much analysis, the results looked ok enough / made sense.

So, then, last step, I tried hybrid (milvus.hybrid_search()), creating a couple of AnnSearchRequest, exactly with the same parameters used in the "normal" searches above, setting the ranker (have tried with both the weights and the RRF ones) and the results that I'm getting are 100% unexpected.

For some reason, all the results that I'm getting are from the last document indexed and not a mix/fusion/rerank of the "normal" searches (that, as said, look correct). BTW, that last document indexed only matches the sparse search minimally (it's way down in the results).

Can anybody imagine why my hybrid search is returning information exclusively from only the last indexed text? I really was expecting it to behave like the "normal" ones, with the reranker doing its work but always matching what the "normal" ones (dense and sparse) return.

Thanks in advance and ciao :-)

most likely this is a misuse the api.
please offer a sample code to reproduce this issue and some sample result if possible.

0 replies

stronk7 · 2025-02-04T17:48:31Z

stronk7
Feb 4, 2025
Author

I've been unable (grrr) to reproduce it with a small dataset. In any case, here it is the exact code that I'm using (note it comes from a Jupyter notebook).

import ipynbname
import json

from langchain_openai import OpenAIEmbeddings
from pymilvus import Function, FunctionType, AnnSearchRequest
from pymilvus import WeightedRanker, RRFRanker
from pymilvus import MilvusClient, FieldSchema, CollectionSchema, DataType

# Some silly data to test everything.
pages = [
    "Security is very important",
    "How to secure your computer",
    "How to secure your phone",
    "Protect yourself from cyber attacks",
    "How to protect your data",
    "Good security habits help",
    "The sky is blue",
    "The moon is white",
    "The sun is yellow",
    "The grass is green",
]

# We are going to use the bg3-m3 model for dense embeddings.
embedding_model = "bge-m3" # Ollama's OpenAI embeddings.
embedding_dimension = 1024 # This is default and cannot change it (Ollama's OpenAI embeddings don't support the dimension parameter).

collection_name = f"{ipynbname.name()}_collection"
milvus = MilvusClient("http://localhost:19530")
if milvus.has_collection(collection_name):
    milvus.drop_collection(collection_name)

fields = [
    FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, auto_id=True, max_length=100),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=5000, enable_analyzer=True, analyzer_params={"type": "english"}, enable_match=True,),
    FieldSchema(name="dense_vector", dtype=DataType.FLOAT_VECTOR, dim=embedding_dimension),
    FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR),
]
schema = CollectionSchema(fields)

bm25_function = Function(
    name="text_bm25_emb",
    input_field_names=["text"], # Input text field
    output_field_names=["sparse_vector"], # Internal mapping sparse vector field
    function_type=FunctionType.BM25, # Model for processing mapping relationship
)

schema.add_function(bm25_function)

index_params = milvus.prepare_index_params()
index_params.add_index(field_name="dense_vector", index_type="HNSW", metric_type="IP", params={"M": 64, "efConstruction": 100})
index_params.add_index(field_name="sparse_vector", index_type="SPARSE_INVERTED_INDEX", metric_type="BM25", params={"inverted_index_algo": "DAAT_WAND", "drop_ratio_build": 0.2})

milvus.create_collection(collection_name, schema=schema, index_params=index_params)

embeddings = OpenAIEmbeddings(model=embedding_model, dimensions=embedding_dimension)

# Let's index the sections and add them to the Milvus collection.
for page in pages:
    print(f"Indexing page: {page}")
    dense_embedding = embeddings.embed_documents([page])
    data = [
        {
            "text": page,
            "dense_vector": dense_embedding[0],
        }
    ]
    milvus.insert(collection_name, data)

milvus.close()

Indexing page: Security is very important
Indexing page: How to secure your computer
Indexing page: How to secure your phone
Indexing page: Protect yourself from cyber attacks
Indexing page: How to protect your data
Indexing page: Good security habits help
Indexing page: The sky is blue
Indexing page: The moon is white
Indexing page: The sun is yellow
Indexing page: The grass is green

milvus = MilvusClient("http://localhost:19530")

query = "which color is the sky?"

dense_search_params = {
    "metric_type": "IP",
    "params": {
        "ef": 10,
    }
}

embedding = embeddings.embed_query(query)
dense_results = milvus.search(
    collection_name=collection_name,
    data=[embedding],
    search_params=dense_search_params,
    anns_field="dense_vector",
    limit=5, output_fields=[
        "id",
        "text",
    ]
)

sparse_search_params = {
    "metric_type": "BM25",
    "drop_ratio_search": 0.2
}

sparse_results = milvus.search(
    collection_name=collection_name,
    data=[query],
    search_params=sparse_search_params,
    anns_field="sparse_vector",
    limit=5, output_fields=[
        "id",
        "text",
    ]
)

# Let's try the hybrid now, using exactly the same params as the previous searches.

dense_search = AnnSearchRequest(
    [embedding], "dense_vector", dense_search_params, limit=10,
)
sparse_search = AnnSearchRequest(
    [query], "sparse_vector", sparse_search_params, limit=10,
)

hybrid_results = milvus.hybrid_search(
    collection_name,
    [dense_search, sparse_search],
    RRFRanker(),
    limit=5,
    output_fields=[
        "id",
        "text",
    ]
)

milvus.close()

dense_retrieved = [
    (
        res["entity"]["id"],
        res["entity"]["text"],
        res["distance"]
    ) for res in dense_results[0]
]
print(f"Dense retrieval: {json.dumps(dense_retrieved, indent=2)}")

sparse_retrieved = [
    (
        res["entity"]["id"],
        res["entity"]["text"],
        res["distance"]
    ) for res in sparse_results[0]
]
print(f"Sparse retrieval: {json.dumps(sparse_retrieved, indent=2)}")

hybrid_retrieved = [
    (
        res["entity"]["id"],
        res["entity"]["text"],
        res["distance"]
    ) for res in hybrid_results[0]
]
print(f"Hybrid retrieval: {json.dumps(hybrid_retrieved, indent=2)}")

Dense retrieval: [
  [
    "455758130082550435",
    "The sky is blue",
    0.7351745963096619
  ],
  [
    "455758130082550439",
    "The sun is yellow",
    0.6092709302902222
  ],
  [
    "455758130082550437",
    "The moon is white",
    0.5868839621543884
  ],
  [
    "455758130082550441",
    "The grass is green",
    0.5120256543159485
  ],
  [
    "455758130082550423",
    "Security is very important",
    0.40660712122917175
  ]
]
Sparse retrieval: [
  [
    "455758130082550435",
    "The sky is blue",
    2.3534746170043945
  ]
]
Hybrid retrieval: [
  [
    "455758130082550435",
    "The sky is blue",
    0.032786883413791656
  ],
  [
    "455758130082550439",
    "The sun is yellow",
    0.016129031777381897
  ],
  [
    "455758130082550437",
    "The moon is white",
    0.01587301678955555
  ],
  [
    "455758130082550441",
    "The grass is green",
    0.015625
  ],
  [
    "455758130082550423",
    "Security is very important",
    0.015384615398943424
  ]
]

In my complete example, the only differences are that:

The collections has like 8-10 more fields (varchar, array, ...). They aren't part of the embeddings / indexes, only "text" is.
The collection has ~30000 documents
In the hybrid search I always get all the results coming from the last inserted document, no matter it's 100% unrelated with the query (but the dense and the sparse searches, alone, work ok).

Will triple check if, maybe, the information that I'm inserting is wrong (they are pieces of plain text of different lengths), but cannot imagine why I'm getting those hybrid strange results... when sparse is looking acceptable.

Ciao :-)

0 replies

stronk7 · 2025-02-04T18:08:26Z

stronk7
Feb 4, 2025
Author

And just for comparison, with exactly the same code than above, but against my big collection, these are the results that I'm getting for the "secure oauth" query.

As you can see, all the hybrid ones come from the same (mediawiki) page that is completely unrelated to the query, it's not retrieved by the dense or sparse individual searches and, by coincidence, that page is "alphabetically" the last one that is indexed (out of ~3000 pages).

Dense retrieval: [
  [
    "tool untoken oauth2",
    "https://docs.moodle.org/405/en/tool_untoken_oauth2#",
    0.6340181827545166
  ],
  [
    "Description",
    "https://docs.moodle.org/405/en/tool_untoken_oauth2#Description",
    0.5650123953819275
  ],
  [
    "Installation",
    "https://docs.moodle.org/405/en/tool_untoken_oauth2#Installation",
    0.5511865019798279
  ]
]
Sparse retrieval: [
  [
    "See also",
    "https://docs.moodle.org/405/en/report/security/report_security_check_noauth#See_also",
    4.536318778991699
  ],
  [
    "report/security/report security check webcron",
    "https://docs.moodle.org/405/en/report/security/report_security_check_webcron#",
    3.627807140350342
  ]
]
Hybrid retrieval: [
  [
    "Features",
    "https://docs.moodle.org/405/en/wiziq_live_class_module#Features",
    0.032522473484277725
  ],
  [
    "wiziq live class module",
    "https://docs.moodle.org/405/en/wiziq_live_class_module#",
    0.032522473484277725
  ],
  [
    "See Also",
    "https://docs.moodle.org/405/en/wiziq_live_class_module#See_Also",
    0.01587301678955555
  ],
  [
    "Installation",
    "https://docs.moodle.org/405/en/wiziq_live_class_module#Installation",
    0.01587301678955555
  ],
  [
    "Support",
    "https://docs.moodle.org/405/en/wiziq_live_class_module#Support",
    0.015625
  ]
]
´´´

3 replies

yhmo Feb 5, 2025
Collaborator

You say the result is retrieved by exactly the same code above.
"tool untoken oauth2" --------------------------------------------------------- this is res["entity"]["id"],
"https://docs.moodle.org/405/en/tool_untoken_oauth2#" ------------ this is res["entity"]["text"],
0.6340181827545166 --------------------------------------------------------- this is res["distance"]

So, the "id" is a short words like "Description", "Features", "See Also" ? Maybe there are duplicated ids in your collection?

stronk7 Feb 5, 2025
Author

I don’t think that’s the case. In the real case the id (PK) is not auto_id, but generated with uuid() using the document url+section as seed. And what I’m outputting is the section name. Just in the minimal example, that works, I did replace it to auto_id and removed any section reference for the sake of simplicity.

In any case, you have brought a good point, I’ll triple check that my manually generated ids are unique, although I’d be expecting they to fail on insert if there are dupes…

yhmo Feb 5, 2025
Collaborator

Could you show me the collection's schema by print(client.describe_collection(collection_name))
And provide data of some rows? You can use client.query(collection_name, filter="id == xxx", output_fields=["*"]) to fetch some rows.

stronk7 · 2025-02-05T08:58:42Z

stronk7
Feb 5, 2025
Author

Here there is some extra information, thank's for looking!

If it's useful, I can share the whole IPYNB somewhere, it's not secret (apart from the OpenAI API base and key needing to be setup in .env or so. Note it can take 1.5h to load and 45m to index (although the problem is reproducible just using the last 50-100 docs instead of the 3000.

Results that I'm getting against the "secure oauth" query, running against the whole database, where I've replaced the "section name" that was in the previous list by the uuid() generated id (PK) that I'm using. It still returns, in the hybrid search, all the references from the last indexed doc (while dense and sparse searches are working ok).

Dense retrieval: [
  [
    "bf428e1d-f221-55de-a77f-a61755a4d727",
    "https://docs.moodle.org/405/en/OAuth_2_authentication#",
    0.6968940496444702
  ],
  [
    "59e06cf8-f390-5093-af2e-3685be593a25",
    "https://docs.moodle.org/405/en/OAuth_2_authentication#Setting_up_OAuth_2_authentication",
    0.6576463580131531
  ],
  [
    "46f64ca6-6094-51fc-bbbe-34e3333c5388",
    "https://docs.moodle.org/405/en/Session_Booking#Outlook_Live_calendar",
    0.6562259197235107
  ],
  [
    "391ada15-580c-5baa-b16f-eeb35d9b1122",
    "https://docs.moodle.org/405/en/auth_oidc#Settings",
    0.582125723361969
  ],
  [
    "996ad860-2a9a-504f-8861-aeafd0b2ae29",
    "https://docs.moodle.org/405/en/auth/dev#Information",
    0.5729366540908813
  ]
]
Sparse retrieval: [
  [
    "391ada15-580c-5baa-b16f-eeb35d9b1122",
    "https://docs.moodle.org/405/en/OAuth_2_authentication#See_also",
    10.065633773803711
  ],
  [
    "b7d55bf4-7057-5113-85c8-141871bf7635",
    "https://docs.moodle.org/405/en/OAuth_2_services#How_do_I_get_a_client_ID_and_secret?",
    9.582510948181152
  ],
  [
    "a4c08562-50fa-5599-939c-eb6f2a83a362",
    "https://docs.moodle.org/405/en/ownCloud_Repository#Acknowledgement",
    9.474544525146484
  ],
  [
    "d35aeaf3-5d1d-535a-a31a-22133ddf5f3d",
    "https://docs.moodle.org/405/en/OAuth_2_services#See_also",
    9.42530345916748
  ],
  [
    "22fe83ae-a20f-54fc-b436-cec85c94c5e8",
    "https://docs.moodle.org/405/en/Publish_as_LTI_tool#Register_Moodle_with_the_platform",
    8.989785194396973
  ]
]
Hybrid retrieval: [
  [
    "391ada15-580c-5baa-b16f-eeb35d9b1122",
    "https://docs.moodle.org/405/en/wiziq_live_class_module#Firewall_Rules",
    0.03154495730996132
  ],
  [
    "59e06cf8-f390-5093-af2e-3685be593a25",
    "https://docs.moodle.org/405/en/wiziq_live_class_module#Installation",
    0.03128054738044739
  ],
  [
    "bf428e1d-f221-55de-a77f-a61755a4d727",
    "https://docs.moodle.org/405/en/wiziq_live_class_module#",
    0.03109932318329811
  ],
  [
    "d35aeaf3-5d1d-535a-a31a-22133ddf5f3d",
    "https://docs.moodle.org/405/en/umm:_Unofficial_Moodle_Mobile_app#Translating",
    0.030330881476402283
  ],
  [
    "996ad860-2a9a-504f-8861-aeafd0b2ae29",
    "https://docs.moodle.org/405/en/wiziq_live_class_module#Features",
    0.03030998818576336
  ]
]

Collection description (p-printed):

{ 'aliases': [],
  'auto_id': False,
  'collection_id': 455758130084527984,
  'collection_name': 'mediawiki_api_collection',
  'consistency_level': 2,
  'description': '',
  'enable_dynamic_field': False,
  'fields': [ { 'description': '',
                'field_id': 100,
                'is_primary': True,
                'name': 'id',
                'params': {'max_length': 100},
                'type': <DataType.VARCHAR: 21>},
              { 'description': '',
                'field_id': 101,
                'name': 'title',
                'params': {'max_length': 1000},
                'type': <DataType.VARCHAR: 21>},
              { 'description': '',
                'field_id': 102,
                'name': 'text',
                'params': { 'analyzer_params': '{"type":"english"}',
                            'enable_analyzer': 'true',
                            'enable_match': 'true',
                            'max_length': 5000},
                'type': <DataType.VARCHAR: 21>},
              { 'description': '',
                'field_id': 103,
                'name': 'source',
                'params': {'max_length': 1000},
                'type': <DataType.VARCHAR: 21>},
              { 'description': '',
                'field_id': 104,
                'name': 'dense_vector',
                'params': {'dim': 1024},
                'type': <DataType.FLOAT_VECTOR: 101>},
              { 'description': '',
                'field_id': 105,
                'is_function_output': True,
                'name': 'sparse_vector',
                'params': {},
                'type': <DataType.SPARSE_FLOAT_VECTOR: 104>},
              { 'description': '',
                'field_id': 106,
                'name': 'parent',
                'params': {'max_length': 100},
                'type': <DataType.VARCHAR: 21>},
              { 'description': '',
                'element_type': <DataType.VARCHAR: 21>,
                'field_id': 107,
                'name': 'children',
                'params': {'max_capacity': 50, 'max_length': 2000},
                'type': <DataType.ARRAY: 22>},
              { 'description': '',
                'element_type': <DataType.VARCHAR: 21>,
                'field_id': 108,
                'name': 'previous',
                'params': {'max_capacity': 50, 'max_length': 2000},
                'type': <DataType.ARRAY: 22>},
              { 'description': '',
                'element_type': <DataType.VARCHAR: 21>,
                'field_id': 109,
                'name': 'next',
                'params': {'max_capacity': 50, 'max_length': 2000},
                'type': <DataType.ARRAY: 22>},
              { 'description': '',
                'field_id': 110,
                'name': 'doc_id',
                'params': {'max_length': 100},
                'type': <DataType.VARCHAR: 21>},
              { 'description': '',
                'field_id': 111,
                'name': 'doc_title',
                'params': {'max_length': 1000},
                'type': <DataType.VARCHAR: 21>},
              { 'description': '',
                'field_id': 112,
                'name': 'doc_hash',
                'params': {'max_length': 100},
                'type': <DataType.VARCHAR: 21>}],
  'functions': [ { 'description': '',
                   'id': 100,
                   'input_field_ids': [102],
                   'input_field_names': ['text'],
                   'name': 'text_bm25_emb',
                   'output_field_ids': [105],
                   'output_field_names': ['sparse_vector'],
                   'params': {},
                   'type': <FunctionType.BM25: 1>}],
  'num_partitions': 1,
  'num_shards': 1,
  'properties': {}}
{None}

Fetch information for id = "bf428e1d-f221-55de-a77f-a61755a4d727" (I've removed the BIG dense_vector list of floats):

data: ['{'previous': [], 'doc_hash': '05dcb633-2903-5e9a-b5d3-6b130de8b3be', 'title': 'wiziq live class module', 'children': ['996ad860-2a9a-504f-8861-aeafd0b2ae29', '59e06cf8-f390-5093-af2e-3685be593a25', '391ada15-580c-5baa-b16f-eeb35d9b1122', 'b7d55bf4-7057-5113-85c8-141871bf7635', '1883fdfb-249b-58f5-b445-87dff6eabc06', '9090025d-5d06-58f1-b79a-3690407024fc', '751af8b4-32a7-55bc-9fad-8bfbcbbf4237', '305f5e8d-e48d-5c3a-8ce3-446622dd8a8a'], 'doc_id': 'a687a112-a98b-5960-86bc-49cd8eea1450', 'id': 'bf428e1d-f221-55de-a77f-a61755a4d727', 'text': "WizIQ released a new Virtual Classroom module for Moodle users. With this new module: \n* No sign-up on WizIQ is required to give or attend classes in Moodle \n* High quality recordings are available for later review\n* 30 day trial version with up to 10 participants, 3 recordings and unlimited classes is available \n* Choose from flexible, no-strings-attached pricing packages with more participants \nThe module enables Moodle users to schedule and deliver online classes via a WizIQ virtual classroom. WizIQ's Web based virtual classroom is equipped with real-time collaboration tools like live chat, 2-way audio and video communication, whiteboard and content sharing (.doc, .docx, .pdf, .xls, .xlsx, .ppt, .pptx, .pps, .ppsx, .swf, .flv and YouTube videos). A free user account is required on org.wiziq.com to download the WizIQ Live Class module available as an 'activity' and 'block'. \n'Activity' features the scheduling of live classes within the Moodle course and 'Block' lists the live classes on the schedule. Any Moodle teacher or administrator on Moodle can schedule a live class by choosing 'WizIQ Live Class' from 'Add an activity' drop-down list in a Moodle course.", 'source': 'https://docs.moodle.org/405/en/wiziq_live_class_module#', 'doc_title': 'wiziq live class module', 'parent': 'None', 'next': []}']

Edited: sorry, copied/pasted the code instead of the results.

0 replies

stronk7 · 2025-02-05T09:05:46Z

stronk7
Feb 5, 2025
Author

Wow, just for coincidence, I saw that the id bf428e1d-f221-55de-a77f-a61755a4d727 appears TWICE in the point 2 above. Once as part of the dense results (assigned to a "correct" page). And another as part of the hybrid results and in the point 3 (assigned to an incorrect page).

How can that be possible? I'm going to triple check my uuid() based id generation to verify if it's leading to dupes, although I was not expecting that (neither I get any message when inserting them).

Ciao :-)

1 reply

yhmo Feb 5, 2025
Collaborator

These ids also duplicated:
"59e06cf8-f390-5093-af2e-3685be593a25"
"391ada15-580c-5baa-b16f-eeb35d9b1122"
"d35aeaf3-5d1d-535a-a31a-22133ddf5f3d"
"996ad860-2a9a-504f-8861-aeafd0b2ae29"

Answer selected by stronk7

stronk7 · 2025-02-05T10:13:55Z

stronk7
Feb 5, 2025
Author

Hi,

I'm regenerating everything from scratch, just to closely examine what's happening with those exact chunk insertions. Basically I generate the ids using uuid5("url + [sep] + pageid + [sep] + sectionid") that supposedly is unique (sha1, from memory)... and cannot understand 1) why it's not or 2) why insertions don't fail.

I'll share here any finding... thanks @yhmo !

0 replies

stronk7 · 2025-02-05T10:40:57Z

stronk7
Feb 5, 2025
Author

Wow, confirmed, somehow my uuid5() generation is leading to duplicates, will examine that later. Curious thing are 1) that inserts of those dupe ids happen without problems and 2) that different searches agains the same collection returns them differently.

In any case, I think that this can be closed as far as I should not be feeding the collection with dupe IDs.

Thanks!

3 replies

yhmo Feb 5, 2025
Collaborator

In milvus, insert() interface doesn't check duplications for user-define ids. If a duplicated id is inserted, the interface doesn't report any error.
upsert() interface is a heavier operation than insert(). If the primary key exists, the upsert() will delete the old entity and insert a new entity. Personally, I don't recommend to use upsert() if you can ensure your ids are unique.
https://milvus.io/docs/upsert-entities.md#Upsert-Entities

yhmo Feb 5, 2025
Collaborator

If a collection contains duplicate ids, search() might return different entities. The duplicated entities might be stored in different segments. Different search requests(different index, different vector type, etc.) might hit different segments. Assume entity_1 and entity_2 are two duplicated entities, the scores of entity_1 and entity_2 might be different for sparse search or dense search, you might see entity_1 in result of sparse search but not exist in dense search, entity_2 in result of dense search but not exist in dense search.

stronk7 Feb 5, 2025
Author

Thanks @yhmo for the complete explanations. I personally find the "insert" behaviour a little bit unexpected, being able to define a field as PK and later not observing such a property.

But surly that's because of how the "normal" (relational, ...) databases work. I imagine that with vector ones things need to be different (although it would be cool to warn in some way). Or, maybe, using filtering on PK field, as we have done above, should return all them, not only one.

I've just finished re-indexing my collection (with stronger uuid5() uniqueness enforcement in place), and can confirm that, now, the results make sense and what I get in the hybrid search is a good mix of what I get in the dense and sparse searches.

Again, many thanks! Will continue testing more milvus stuff, it's looking great!

Ciao :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting some strange results with hybrid_search() compared with dense + sparse results #39632

{{title}}

Replies: 7 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Getting some strange results with hybrid_search() compared with dense + sparse results #39632

stronk7 Feb 4, 2025

Replies: 7 comments · 7 replies

xiaofan-luan Feb 4, 2025 Maintainer

stronk7 Feb 4, 2025 Author

stronk7 Feb 4, 2025 Author

yhmo Feb 5, 2025 Collaborator

stronk7 Feb 5, 2025 Author

yhmo Feb 5, 2025 Collaborator

stronk7 Feb 5, 2025 Author

stronk7 Feb 5, 2025 Author

yhmo Feb 5, 2025 Collaborator

stronk7 Feb 5, 2025 Author

stronk7 Feb 5, 2025 Author

yhmo Feb 5, 2025 Collaborator

yhmo Feb 5, 2025 Collaborator

stronk7 Feb 5, 2025 Author

stronk7
Feb 4, 2025

Replies: 7 comments 7 replies

xiaofan-luan
Feb 4, 2025
Maintainer

stronk7
Feb 4, 2025
Author

stronk7
Feb 4, 2025
Author

yhmo Feb 5, 2025
Collaborator

stronk7 Feb 5, 2025
Author

yhmo Feb 5, 2025
Collaborator

stronk7
Feb 5, 2025
Author

stronk7
Feb 5, 2025
Author

yhmo Feb 5, 2025
Collaborator

stronk7
Feb 5, 2025
Author

stronk7
Feb 5, 2025
Author

yhmo Feb 5, 2025
Collaborator

yhmo Feb 5, 2025
Collaborator

stronk7 Feb 5, 2025
Author