Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyMuPDF changes from 0.3.15 mess with existing Vectorstores #29470

Open
5 tasks done
KarlCJ opened this issue Jan 29, 2025 · 1 comment
Open
5 tasks done

PyMuPDF changes from 0.3.15 mess with existing Vectorstores #29470

KarlCJ opened this issue Jan 29, 2025 · 1 comment
Labels
investigate Flagged for investigation. Ɑ: vector store Related to vector store module

Comments

@KarlCJ
Copy link

KarlCJ commented Jan 29, 2025

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

docs = [doc for doc in PyMuPDFLoader(file_path=file_path, extract_images=False).load()]

documents = SemanticChunker(
            breakpoint_threshold_type="gradient",
            embeddings=embeddings,
            breakpoint_threshold_amount=settings.chunking_breakpoint_threshold_amount,
            min_chunk_size=settings.min_chunk_size,
        ).split_documents([doc for doc in docs if doc.page_content])

index(
                    docs_source=documents,
                    vector_store=self.store,
                    record_manager=self.record_manager,
                    cleanup=None,
                )

Error Message and Stack Trace (if applicable)

ERROR:langchain_weaviate.vectorstores:indexer-0:Failed to add object: None
Reason: WeaviateInsertManyAllFailedError("Every object failed during insertion. Here is the set of all errors: no such prop with name 'creationdate' found in class 'Default_index_openai_text_embedding_3_small' in the schema. Check your schema files for which properties in this class are available\nno such prop with name 'moddate' found in class 'Default_index_openai_text_embedding_3_small' in the schema. Check your schema files for which properties in this class are available")

Description

With the changes from langchain_community 0.3.15 the metadata keys have changed from modDate and creationDate to moddate and creationdate.
This causes issues with already existing vectorstore schemas(in my case Weaviate). Error is below.
This causes the index() function from langchain.indexes to fail but not raise an exception, instead it return value describes that everything got indexed correctly which it did not. The exception is only visible in the log and cannot be catched.

For my project I fixed this by mapping the new metadata keys to the old ones because weaviate refuses to accept the new keys to an already existing schema.

reproducible by creating a vectorstore and indexing a pdf (loaded with PyMuPDFLoader) with langchain_community <= 0.3.14 and after updating to langchain_community >= 0.3.15 trying to index a pdf the same way.

System Info

System Information

OS: Linux
OS Version: #1 SMP PREEMPT_DYNAMIC Wed Jan 22 13:59:07 UTC 2025
Python Version: 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0]

Package Information

langchain_core: 0.3.31
langchain: 0.3.15
langchain_community: 0.3.16
langsmith: 0.2.11
langchain_aws: 0.2.11
langchain_experimental: 0.3.4
langchain_ollama: 0.2.2
langchain_openai: 0.3.2
langchain_text_splitters: 0.3.5
langchain_weaviate: 0.0.3

Optional packages not installed

langserve

Other Dependencies

aiohttp: 3.11.11
async-timeout: Installed. No version info available.
boto3: 1.36.6
dataclasses-json: 0.6.7
httpx: 0.28.1
httpx-sse: 0.4.0
jsonpatch: 1.33
langsmith-pyo3: Installed. No version info available.
numpy: 1.26.4
ollama: 0.4.7
openai: 1.60.1
orjson: 3.10.15
packaging: 24.2
pydantic: 2.10.6
pydantic-settings: 2.7.1
PyYAML: 6.0.2
requests: 2.32.3
requests-toolbelt: 1.0.0
simsimd: 4.4.0
SQLAlchemy: 2.0.37
tenacity: 9.0.0
tiktoken: 0.8.0
typing-extensions: 4.12.2
weaviate-client: 4.10.4
zstandard: Installed. No version info available.

@langcarl langcarl bot added the investigate Flagged for investigation. label Jan 29, 2025
@dosubot dosubot bot added the Ɑ: vector store Related to vector store module label Jan 29, 2025
@eyurtsev
Copy link
Collaborator

eyurtsev commented Jan 31, 2025

cc @pprados could you take a look?


Can we add a flag to preserve old behavior? (e.g., metadata_format="legacy")

With the changes from langchain_community 0.3.15 the metadata keys have changed from modDate and creationDate to moddate and creationdate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigate Flagged for investigation. Ɑ: vector store Related to vector store module
Projects
None yet
Development

No branches or pull requests

2 participants