Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AzureAISearch Retriever only returns up to 50 docs #27830

Closed
5 tasks done
sjjpo2002 opened this issue Nov 1, 2024 · 1 comment
Closed
5 tasks done

AzureAISearch Retriever only returns up to 50 docs #27830

sjjpo2002 opened this issue Nov 1, 2024 · 1 comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@sjjpo2002
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

To reproduce the issue mentioned here. Create an Azure Search AI index and upload any number of documents above 50 that share a search field. This could be source in the metadata. For example the same file name on all chunks. Instantiate the retriver:

retriever = AzureAISearchRetriever(
            service_name=.AZURE_SEARCH_ENDPOINT,
            index_name=AZURE_SEARCH_INDEX_NAME,
            api_key=AZURE_SEARCH_KEY,
            content_key="content",
            top_k=None,
        )

and invoke a query like:

retriever.invoke(doc.metadata["source"])

setting top_k to None should return all the results according to the documentation:

top_k: Optional[int] = None
"""Number of results to retrieve. Set to None to retrieve all results."""

But, because of the default number of 50 set by Azure, the returned results will always be up to 50 at the current implementation.

Error Message and Stack Trace (if applicable)

No response

Description

Azure AI Search service doesn't return all matches when a query is submitted using the search field as it is documented on their website:

"By default, the search engine returns up to the first 50 matches. The top 50 are determined by search score, assuming the query is full text search or semantic."

From the same documentation we can understand that we need to implement pagination if we want to retrieve all the documents when we query the service:

"To control the paging of all documents returned in a result set, add $top and $skip parameters to the GET query request, or top and skip to the POST query request. The following list explains the logic.

Return the first set of 15 matching documents plus a count of total matches: GET /indexes//docs?search=&$top=15&$skip=0&$count=true

Return the second set, skipping the first 15 to get the next 15: $top=15&$skip=15. Repeat for the third set of 15: $top=15&$skip=30"

If we look at the existing code there is no pagination implemented. This makes this retriever to return up to 50 results no matter how many records are in the database. This behavior is not fully documented and can result in unexpected behavior in scenarios where the user intended to retrieve all the documents. This is clear from the function that builds the API query:

def _build_search_url(self, query: str) -> str:
        url_suffix = get_from_env("", "AZURE_AI_SEARCH_URL_SUFFIX", DEFAULT_URL_SUFFIX)
        if url_suffix in self.service_name and "https://" in self.service_name:
            base_url = f"{self.service_name}/"
        elif url_suffix in self.service_name and "https://" not in self.service_name:
            base_url = f"https://{self.service_name}/"
        elif url_suffix not in self.service_name and "https://" in self.service_name:
            base_url = f"{self.service_name}.{url_suffix}/"
        elif (
            url_suffix not in self.service_name and "https://" not in self.service_name
        ):
            base_url = f"https://{self.service_name}.{url_suffix}/"
        else:
            # pass to Azure to throw a specific error
            base_url = self.service_name
        endpoint_path = f"indexes/{self.index_name}/docs?api-version={self.api_version}"
        top_param = f"&$top={self.top_k}" if self.top_k else ""
        filter_param = f"&$filter={self.filter}" if self.filter else ""
        return base_url + endpoint_path + f"&search={query}" + top_param + filter_param

To reproduce the issue mentioned here. Create an Azure Search AI index and upload any number of documents above 50 that share a search field. This could be source in the metadata. For example the same file name on all chunks. Instantiate the retriver:

retriever = AzureAISearchRetriever(
            service_name=.AZURE_SEARCH_ENDPOINT,
            index_name=AZURE_SEARCH_INDEX_NAME,
            api_key=AZURE_SEARCH_KEY,
            content_key="content",
            top_k=None,
        )

and invoke a query like:

retriever.invoke(doc.metadata["source"])

setting top_k to None should return all the results according to the documentation:

top_k: Optional[int] = None
"""Number of results to retrieve. Set to None to retrieve all results."""

But, because of the default number of 50 set by Azure, the returned results will always be up to 50 at the current implementation.

System Info

System Information

OS: Linux
OS Version: #1 SMP Wed Sep 11 18:02:00 EDT 2024
Python Version: 3.11.9 (main, Aug 26 2024, 10:40:41) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)]

Package Information

langchain_core: 0.2.33
langchain: 0.2.5
langchain_community: 0.2.5
langsmith: 0.1.101
langchain_cli: 0.0.29
langchain_openai: 0.1.22
langchain_text_splitters: 0.2.2
langserve: 0.2.2

Optional packages not installed

langgraph

Other Dependencies

aiohttp: 3.9.5
async-timeout: Installed. No version info available.
dataclasses-json: 0.6.7
fastapi: 0.110.0
gitpython: 3.1.43
httpx: 0.27.0
jsonpatch: 1.33
langserve[all]: Installed. No version info available.
libcst: 1.4.0
numpy: 1.26.4
openai: 1.41.0
orjson: 3.10.5
packaging: 23.2
pydantic: 2.6.2
pyproject-toml: 0.0.10
PyYAML: 5.3.1
requests: 2.32.3
SQLAlchemy: 2.0.27
sse-starlette: 1.8.2
tenacity: 8.4.1
tiktoken: 0.7.0
tomlkit: 0.12.5
typer[all]: Installed. No version info available.
typing-extensions: 4.12.2
uvicorn: 0.23.2

@sjjpo2002 sjjpo2002 changed the title AzureAISearch Retriever only returns up to 5 docs AzureAISearch Retriever only returns up to 50 docs Nov 1, 2024
@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Nov 1, 2024
Copy link

dosubot bot commented Jan 31, 2025

Hi, @sjjpo2002. I'm Dosu, and I'm helping the LangChain team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • The Azure AI Search Retriever is limited to returning a maximum of 50 documents.
  • Setting top_k to None does not return all results as expected.
  • The limitation is due to Azure's default settings.
  • You suggested implementing pagination to retrieve all documents.
  • No further comments or developments have been made on this issue.

Next Steps:

  • Please confirm if this issue is still relevant with the latest version of the LangChain repository. If so, you can keep the discussion open by commenting here.
  • If there is no response, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 31, 2025
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 7, 2025
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant