Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Implementation Incompatible with Node.js Document Storage prefix path #699

Open
Seigneurhol opened this issue Jan 17, 2025 · 1 comment

Comments

@Seigneurhol
Copy link

Description

Looking at the error logs and your Node.js implementation, I notice a key difference in how documents are stored and retrieved. The Python implementation assumes documents are stored with a "documents/" prefix, while the Node.js implementation appears to store them directly in the root of the bucket.
This mismatch causes issues when the Python implementation attempts to retrieve documents. Specifically, the Python implementation is searching for documents at paths like:

documents/49689d38e7eb
While the Node.js implementation is storing them as:

49689d38e7eb

Suggested Fix

Update the Python implementation to match the Node.js storage pattern by removing the "documents/" prefix:

class GCSDocumentStorage(DocumentStorage):
    """Stores documents in Google Cloud Storage.
    For each pair id, document_text the name of the blob will be {prefix}/{id} stored
    in plain text format.
    """

    def __init__(
        self,
        bucket: storage.Bucket,
        prefix: Optional[str] = "documents", # Remove "documents" here
        threaded=True,
        n_threads=8,
    ) -> None:

Or being able to pass GCSDocumentStorage instead of the bucket name.

Let me know if you need additional details or logs to debug this further.
Thank you !

@lkuligin
Copy link
Collaborator

Could you add a link to nodejs implementation you mentioned, please?

You still can directly initiate VectorSearchVectorStore and change the prefix since GCSDocumentStorage has the prefix arg. We could add prefix arg also to from_components method and pass it to GCSDocumentStorage.
So that's the second option you suggested. Please, feel free to send a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants