Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved handling of multidoc insertion errors for vector store #110

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

hemidactylus
Copy link
Collaborator

@hemidactylus hemidactylus commented Jan 31, 2025

This PR alleviates a problem with the kind of error that is raised under different failure scenarios during AstraDBVectorStore insertions.

(Do not mind the CI failing. There is some bug with the CI actions related to secrets not trickling down to the test workflows. I ran the tests locally and everything related went smooth.) <== Nevermind, those CI problems somehow have vanished (by themselves?).

Current problems:

the insertion is attempted assuming new document _ids. If it fails, the (detailed) errors from astrapy are inspected.
(note: "errors", plural, since a single insertion can result in a number of errors from several documents).

  1. If they are all of type "id already exists", then the flow switches to running a bunch of find-one-and-update and everything works
  2. if there are (also) other errors from the Data API, an exception is surfaced to the user.

Now, this exception right now is the one received from Astrapy, a fact which has two flaws:

  • It includes the "doc already exists" bits (that the V.Store in reality could handle if they were the only problem)
  • Its string representation, in astrapy 1.x, just lists the first of the possibly many errors. (this is fixed in astrapy 2.0 but that's another story)

These 2 issues can combine leading to a scenario where all the user sees is a "doc already exists", adding to the confusion.

Also it can be argued that it is leaky, in this case, to expose an astrapy exception as is to the user.

What this PR does

The change is for case 2 above.

  • Now a ValueError is raised (and not a astrapy.exception.InsertManyException anymore - slightly breaking for try-catch code)
  • The text message of this ValueError is a concatenation of all errors from the Data API triggered by the insertion, with the "already exists" filtered out.

Below is a Python script exemplifying various insertion scenarios and the before- and after- kind of exception a user would get.

Note:

Right now, all of the errors end up in the error message, even if they are very many. I wonder if it's better to put a upper limit to this and avoid extremely-long strings ending in logs etc, with little added value over a cap of, say, 20 errors.

Demo script (before-and-after)

"""
This script demonstrates the changes this PR brings to the insertion exception logic.

The vectorstore starts empty and then runs 5 add_documents calls, with
new-id and preexisting-id, with faulty documents thrown in, as shown in the diagram
a few lines below.

For the relevant insertions, the class and message of the exception
is given in comments in this code.

I think this change fixes the shortcomings about
truncated/irrelevant error messages.

A question remains: shall the all-error description be given an upper limit to the number
of errors it concatenates (20 or so) ? It could become a *really* long string otherwise
(and if the insertion fails 20+ times, there's probably little point in notifying the user
about all of them...)
"""

import json
import os

from langchain_core.embeddings import Embeddings
from langchain_core.documents import Document
from langchain_astradb import AstraDBVectorStore


class ParserEmbeddings(Embeddings):
    """Parse input texts: if they are json for a List[float], fine.
    Otherwise, return all zeros and call it a day.
    """

    def __init__(self, dimension: int) -> None:
        self.dimension = dimension

    def embed_documents(self, texts: list[str]) -> list[list[float]]:
        return [self.embed_query(txt) for txt in texts]

    async def aembed_documents(self, texts: list[str]) -> list[list[float]]:
        return self.embed_documents(texts)

    def embed_query(self, text: str) -> list[float]:
        try:
            vals = json.loads(text)
        except json.JSONDecodeError:
            return [0.0] * self.dimension
        else:
            assert len(vals) == self.dimension
            return vals

    async def aembed_query(self, text: str) -> list[float]:
        return self.embed_query(text)

"""
Insertion rounds, plan:

_id    A     B     C     D     E
 00    I     .     .     .     .
 01    I     .     .     .     .
 02    I     .     .     .     .
 03    I     .     .     .     .
 04    I     .     .     .     .
 05    I     O     .     .     .
 06    I     O     .     .     .
 07    I     O     .     .     .
 08    I     O     .     .     .
 09    I     O     .     .     .
 10    .     I     .     .     .
 11    .     I     .     .     .
 12    .     I     .     .     .
 13    .     I     .     .     .
 14    .     I     O     .     .
 15    .     .     F     .     .
 16    .     .     I     .     .
 17    .     .     I     .     .
 18    .     .     I     .     .
 19    .     .     I     .     .
 20    .     .     I     F     .
 21    .     .     I     F     .
 22    .     .     I     .     F
 23    .     .     I     .     .
 24    .     .     I     .     .

    I = insert
    O = overwrite (insert on preexisting ID)
    F = faulty doc supplied
"""

if __name__ == "__main__":
    embe = ParserEmbeddings(2)

    vstore = AstraDBVectorStore(
        collection_name="errmessages",
        embedding=embe,
        token=os.environ["ASTRA_DB_APPLICATION_TOKEN"],
        api_endpoint=os.environ["ASTRA_DB_API_ENDPOINT"],
        batch_size=5,
    )
    vstore.clear()

    # A: insertion ok
    vstore.add_documents([
        Document(page_content=f"[1,{di}]", metadata={"a": di, "gen": "A"}, id=f"{di:02}")
        for di in range(10)
    ])

    # B: insertion with dupes
    vstore.add_documents([
        Document(page_content=f"[1,{di}]", metadata={"a": di, "gen": "B"}, id=f"{di:02}")
        for di in range(5, 15)
    ])

    # C: insertion with mixed errors
    try:
        vstore.add_documents(
            [
                # a puny overwrite
                Document(page_content="[1,14]", metadata={"a": 14, "gen": "C"}, id="14"),
            ] + [
                # new ones - but with a faulty doc in the middle
                Document(page_content="[1,15]", metadata={"a": 15, "gen": "C"}, id="15"),
                Document(page_content="[1,16]", metadata={"a": 16, "gen": "C", "$trouble?": "yes"}, id="16"),
            ] + [
                Document(page_content=f"[1,{di}]", metadata={"a": di, "gen": "C"}, id=f"{di:02}")
                for di in range(17, 25)
            ]
        )
    except Exception as err:
        print("\nC => ERROR")
        print(f"    str(err):  {str(err)}")
        print(f"    type(err): {type(err)}")
        print(f"    err:       {err}")

    """
    Current output is incomplete and misleading.
    (the mentioned _id does not get overwritten in the collection, but the faulty one is not reported!)

        C => ERROR
            str(err):  Failed to insert document with _id 14: Document already exists with the given _id
            type(err): <class 'astrapy.exceptions.InsertManyException'>
            err:       Failed to insert document with _id 14: Document already exists with the given _id

    This PR:

        C => ERROR
            str(err):  Cannot insert documents. The Data API returned the following error(s): Failed to insert document with _id 16: Document field name invalid: field name ('$trouble?') contains invalid character(s), can contain only letters (a-z/A-Z), numbers (0-9), underscores (_), and hyphens (-)
            type(err): <class 'ValueError'>
            err:       Cannot insert documents. The Data API returned the following error(s): Failed to insert document with _id 16: Document field name invalid: field name ('$trouble?') contains invalid character(s), can contain only letters (a-z/A-Z), numbers (0-9), underscores (_), and hyphens (-)
    """

    # D: insertion with only errors
    try:
        vstore.add_documents(
            [
                Document(page_content="[1,20]", metadata={"a": 20, "gen": "D", "$trouble?": "yes"}, id="20"),
                Document(page_content="[1,21]", metadata={"a": 21, "gen": "D", "superlist": list(range(1001))}, id="21"),
            ]
        )
    except Exception as err:
        print("\nD => ERROR")
        print(f"    str(err):  {str(err)}")
        print(f"    type(err): {type(err)}")
        print(f"    err:       {err}")
    """
    Current output is incomplete: only the first error is mentioned.

        D => ERROR
            str(err):  Failed to insert document with _id 20: Document field name invalid: field name ('$trouble?') contains invalid character(s), can contain only letters (a-z/A-Z), numbers (0-9), underscores (_), and hyphens (-)
            type(err): <class 'astrapy.exceptions.InsertManyException'>
            err:       Failed to insert document with _id 20: Document field name invalid: field name ('$trouble?') contains invalid character(s), can contain only letters (a-z/A-Z), numbers (0-9), underscores (_), and hyphens (-)

    This PR:

        D => ERROR
            str(err):  Cannot insert documents. The Data API returned the following error(s): Failed to insert document with _id 20: Document field name invalid: field name ('$trouble?') contains invalid character(s), can contain only letters (a-z/A-Z), numbers (0-9), underscores (_), and hyphens (-); Failed to insert document with _id 21: Document size limitation violated: number of elements an indexable Array (property 'superlist') has (1001) exceeds maximum allowed (1000)
            type(err): <class 'ValueError'>
            err:       Cannot insert documents. The Data API returned the following error(s): Failed to insert document with _id 20: Document field name invalid: field name ('$trouble?') contains invalid character(s), can contain only letters (a-z/A-Z), numbers (0-9), underscores (_), and hyphens (-); Failed to insert document with _id 21: Document size limitation violated: number of elements an indexable Array (property 'superlist') has (1001) exceeds maximum allowed (1000)
    """

    # E: *single* insertion with an error
    try:
        vstore.add_documents(
            [
                Document(page_content="[1,22]", metadata={"a": 22, "gen": "D", "$trouble?": "yes"}, id="22"),
            ]
        )
    except Exception as err:
        print("\nE => ERROR")
        print(f"    str(err):  {str(err)}")
        print(f"    type(err): {type(err)}")
        print(f"    err:       {err}")
    """
    Current output is lacking the 'failed to insert document with id' preamble:

            E => ERROR
                str(err):  Document field name invalid: field name ('$trouble?') contains invalid character(s), can contain only letters (a-z/A-Z), numbers (0-9), underscores (_), and hyphens (-)
                type(err): <class 'astrapy.exceptions.InsertManyException'>
                err:       Document field name invalid: field name ('$trouble?') contains invalid character(s), can contain only letters (a-z/A-Z), numbers (0-9), underscores (_), and hyphens (-)

    **NOTE**: The lack of _id in this case is a Data API behaviour (filed an issue: https://github.com/stargate/data-api/issues/1840). Should not ever be a real problem in this context.

    This PR:

        E => ERROR
            str(err):  Cannot insert documents. The Data API returned the following error(s): Document field name invalid: field name ('$trouble?') contains invalid character(s), can contain only letters (a-z/A-Z), numbers (0-9), underscores (_), and hyphens (-)
            type(err): <class 'ValueError'>
            err:       Cannot insert documents. The Data API returned the following error(s): Document field name invalid: field name ('$trouble?') contains invalid character(s), can contain only letters (a-z/A-Z), numbers (0-9), underscores (_), and hyphens (-)
    """

@hemidactylus hemidactylus requested a review from cbornet January 31, 2025 16:03
@hemidactylus
Copy link
Collaborator Author

Huh? The integration tests are able to run again? Well that's definitely good news (though I wonder what was the cause for the secrets not passing through in the first place).

@hemidactylus hemidactylus marked this pull request as draft January 31, 2025 17:18
@hemidactylus
Copy link
Collaborator Author

Added a cap (currently 8) on the amount of errors ending up in the message string.
Now it may end (if necessary) with something like:

[...] (Note: 4 further errors omitted.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant