Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CacheBackedEmbeddings don't hash keys, resulting in error with LocalFileStore #29496

Open
5 tasks done
lyger opened this issue Jan 30, 2025 · 1 comment
Open
5 tasks done
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature investigate Flagged for investigation.

Comments

@lyger
Copy link

lyger commented Jan 30, 2025

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import langchain.embeddings
import langchain.storage
import langchain_core.documents
import langchain_core.vectorstores
import langchain_openai

documents = [
    langchain_core.documents.Document("This text is invalid for a key"),
    langchain_core.documents.Document("This text is also invalid for a key"),
    langchain_core.documents.Document("このテキストはキーになりません"),
]

cache = langchain.storage.LocalFileStore("./cache/")

embeddings = langchain.embeddings.CacheBackedEmbeddings(
    underlying_embeddings=langchain_openai.OpenAIEmbeddings(),
    document_embedding_store=cache,
)

vector_store = langchain_core.vectorstores.InMemoryVectorStore(embeddings)
vector_store.add_documents(documents)

Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
  File "/***/cache_backed_bug_minimal.py", line 21, in <module>
    vector_store.add_documents(documents)
  File "/***/lib/python3.12/site-packages/langchain_core/vectorstores/in_memory.py", line 175, in add_documents
    vectors = self.embedding.embed_documents(texts)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/***/lib/python3.12/site-packages/langchain/embeddings/cache.py", line 124, in embed_documents
    vectors: List[Union[List[float], None]] = self.document_embedding_store.mget(
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/***/lib/python3.12/site-packages/langchain/storage/file_system.py", line 118, in mget
    full_path = self._get_full_path(key)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/***/lib/python3.12/site-packages/langchain/storage/file_system.py", line 77, in _get_full_path
    raise InvalidKeyException(f"Invalid characters in key: {key}")
langchain_core.stores.InvalidKeyException: Invalid characters in key: This text is invalid for a key

Description

I'm trying to cache embeddings locally, and I thought the above code should work with arbitrary document text, since CacheBackedEmbeddings documentation says, "The text is hashed and the hash is used as the key in the cache."

However, it seems that the text is not hashed at all, and the cache attempts to use the raw text as a key, which raises an error for any disallowed characters including spaces and non-ASCII characters.

System Info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 23.5.0: Wed May 1 20:16:51 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8103
Python Version: 3.12.4 (main, Sep 13 2024, 16:08:42) [Clang 15.0.0 (clang-1500.3.9.4)]

Package Information

langchain_core: 0.3.33
langchain: 0.3.17
langchain_community: 0.3.16
langsmith: 0.3.2
langchain_openai: 0.3.2
langchain_text_splitters: 0.3.5
langgraph_sdk: 0.1.51
langgraph_test: Installed. No version info available.

Optional packages not installed

langserve

Other Dependencies

aiohttp: 3.11.11
async-timeout: Installed. No version info available.
dataclasses-json: 0.6.7
httpx: 0.28.1
httpx-sse: 0.4.0
jsonpatch: 1.33
langsmith-pyo3: Installed. No version info available.
numpy: 2.2.0
openai: 1.60.2
orjson: 3.10.15
packaging: 24.2
pydantic: 2.10.3
pydantic-settings: 2.7.1
pytest: 8.3.4
PyYAML: 6.0.2
requests: 2.32.3
requests-toolbelt: 1.0.0
rich: 13.9.4
SQLAlchemy: 2.0.37
tenacity: 9.0.0
tiktoken: 0.8.0
typing-extensions: 4.12.2
zstandard: 0.23.0

@langcarl langcarl bot added the investigate Flagged for investigation. label Jan 30, 2025
@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Jan 30, 2025
@lyger
Copy link
Author

lyger commented Jan 30, 2025

I've figured out that you're supposed to use CacheBackedEmbeddings.from_bytes_store instead of initializing directly, but I'm leaving this open because it feels like that should be explained clearly in the main class documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature investigate Flagged for investigation.
Projects
None yet
Development

No branches or pull requests

1 participant