AzureAISearch Retriever only returns up to 50 docs #27830

sjjpo2002 · 2024-11-01T18:15:45Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

To reproduce the issue mentioned here. Create an Azure Search AI index and upload any number of documents above 50 that share a search field. This could be source in the metadata. For example the same file name on all chunks. Instantiate the retriver:

retriever = AzureAISearchRetriever(
            service_name=.AZURE_SEARCH_ENDPOINT,
            index_name=AZURE_SEARCH_INDEX_NAME,
            api_key=AZURE_SEARCH_KEY,
            content_key="content",
            top_k=None,
        )

and invoke a query like:

retriever.invoke(doc.metadata["source"])

setting top_k to None should return all the results according to the documentation:

top_k: Optional[int] = None
"""Number of results to retrieve. Set to None to retrieve all results."""

But, because of the default number of 50 set by Azure, the returned results will always be up to 50 at the current implementation.

Error Message and Stack Trace (if applicable)

No response

Description

Azure AI Search service doesn't return all matches when a query is submitted using the search field as it is documented on their website:

"By default, the search engine returns up to the first 50 matches. The top 50 are determined by search score, assuming the query is full text search or semantic."

From the same documentation we can understand that we need to implement pagination if we want to retrieve all the documents when we query the service:

"To control the paging of all documents returned in a result set, add $top and $skip parameters to the GET query request, or top and skip to the POST query request. The following list explains the logic.

Return the first set of 15 matching documents plus a count of total matches: GET /indexes//docs?search=&$top=15&$skip=0&$count=true

Return the second set, skipping the first 15 to get the next 15: $top=15&$skip=15. Repeat for the third set of 15: $top=15&$skip=30"

If we look at the existing code there is no pagination implemented. This makes this retriever to return up to 50 results no matter how many records are in the database. This behavior is not fully documented and can result in unexpected behavior in scenarios where the user intended to retrieve all the documents. This is clear from the function that builds the API query:

def _build_search_url(self, query: str) -> str:
        url_suffix = get_from_env("", "AZURE_AI_SEARCH_URL_SUFFIX", DEFAULT_URL_SUFFIX)
        if url_suffix in self.service_name and "https://" in self.service_name:
            base_url = f"{self.service_name}/"
        elif url_suffix in self.service_name and "https://" not in self.service_name:
            base_url = f"https://{self.service_name}/"
        elif url_suffix not in self.service_name and "https://" in self.service_name:
            base_url = f"{self.service_name}.{url_suffix}/"
        elif (
            url_suffix not in self.service_name and "https://" not in self.service_name
        ):
            base_url = f"https://{self.service_name}.{url_suffix}/"
        else:
            # pass to Azure to throw a specific error
            base_url = self.service_name
        endpoint_path = f"indexes/{self.index_name}/docs?api-version={self.api_version}"
        top_param = f"&$top={self.top_k}" if self.top_k else ""
        filter_param = f"&$filter={self.filter}" if self.filter else ""
        return base_url + endpoint_path + f"&search={query}" + top_param + filter_param

To reproduce the issue mentioned here. Create an Azure Search AI index and upload any number of documents above 50 that share a search field. This could be source in the metadata. For example the same file name on all chunks. Instantiate the retriver:

retriever = AzureAISearchRetriever(
            service_name=.AZURE_SEARCH_ENDPOINT,
            index_name=AZURE_SEARCH_INDEX_NAME,
            api_key=AZURE_SEARCH_KEY,
            content_key="content",
            top_k=None,
        )

and invoke a query like:

retriever.invoke(doc.metadata["source"])

setting top_k to None should return all the results according to the documentation:

top_k: Optional[int] = None
"""Number of results to retrieve. Set to None to retrieve all results."""

But, because of the default number of 50 set by Azure, the returned results will always be up to 50 at the current implementation.

System Info

System Information

OS: Linux
OS Version: #1 SMP Wed Sep 11 18:02:00 EDT 2024
Python Version: 3.11.9 (main, Aug 26 2024, 10:40:41) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)]

Package Information

langchain_core: 0.2.33
langchain: 0.2.5
langchain_community: 0.2.5
langsmith: 0.1.101
langchain_cli: 0.0.29
langchain_openai: 0.1.22
langchain_text_splitters: 0.2.2
langserve: 0.2.2

Optional packages not installed

langgraph

Other Dependencies

aiohttp: 3.9.5
async-timeout: Installed. No version info available.
dataclasses-json: 0.6.7
fastapi: 0.110.0
gitpython: 3.1.43
httpx: 0.27.0
jsonpatch: 1.33
langserve[all]: Installed. No version info available.
libcst: 1.4.0
numpy: 1.26.4
openai: 1.41.0
orjson: 3.10.5
packaging: 23.2
pydantic: 2.6.2
pyproject-toml: 0.0.10
PyYAML: 5.3.1
requests: 2.32.3
SQLAlchemy: 2.0.27
sse-starlette: 1.8.2
tenacity: 8.4.1
tiktoken: 0.7.0
tomlkit: 0.12.5
typer[all]: Installed. No version info available.
typing-extensions: 4.12.2
uvicorn: 0.23.2

The text was updated successfully, but these errors were encountered:

dosubot · 2025-01-31T16:01:08Z

Hi, @sjjpo2002. I'm Dosu, and I'm helping the LangChain team manage their backlog. I'm marking this issue as stale.

Issue Summary:

The Azure AI Search Retriever is limited to returning a maximum of 50 documents.
Setting top_k to None does not return all results as expected.
The limitation is due to Azure's default settings.
You suggested implementing pagination to retrieve all documents.
No further comments or developments have been made on this issue.

Next Steps:

Please confirm if this issue is still relevant with the latest version of the LangChain repository. If so, you can keep the discussion open by commenting here.
If there is no response, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

sjjpo2002 changed the title ~~AzureAISearch Retriever only returns up to 5 docs~~ AzureAISearch Retriever only returns up to 50 docs Nov 1, 2024

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Nov 1, 2024

Anash3 mentioned this issue Jan 31, 2025

added pagination for azure ai search retriever #29525

Closed

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 31, 2025

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 7, 2025

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AzureAISearch Retriever only returns up to 50 docs #27830

AzureAISearch Retriever only returns up to 50 docs #27830

sjjpo2002 commented Nov 1, 2024

dosubot bot commented Jan 31, 2025

AzureAISearch Retriever only returns up to 50 docs #27830

AzureAISearch Retriever only returns up to 50 docs #27830

Comments

sjjpo2002 commented Nov 1, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

dosubot bot commented Jan 31, 2025