Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in retrieval after adding new data in langchain-chroma vectordb #29499

Open
5 tasks done
Shabbir-iRoidSolutions opened this issue Jan 30, 2025 · 0 comments
Open
5 tasks done
Labels
Ɑ: vector store Related to vector store module

Comments

@Shabbir-iRoidSolutions
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

# ----------- code to store data in vectordb ----------------
ext_to_loader = {
    '.csv': CSVLoader,
    '.json': JSONLoader,
    '.txt': TextLoader,
    '.pdf': PDFPlumberLoader,
    '.docx': Docx2txtLoader,
    '.pptx': PPTXLoader,
    '.xlsx': ExcelLoader,
    '.xls': ExcelLoader,
    'single_page_url':WebBaseLoader,
    'all_urls_from_base_url':  RecursiveUrlLoader,
    'directory': DirectoryLoader
}

def get_loader_for_extension(file_path):
    _, ext = os.path.splitext(file_path)
    loader_class = ext_to_loader.get(ext.lower())
    if loader_class:
        return loader_class(file_path)
    else:
        print(f"Unsupported file extension: {ext}")
        return None

def normalize_documents(docs):
    return [
        doc.page_content if isinstance(doc.page_content, str) else '\n'.join(doc.page_content) if isinstance(doc.page_content, list) else ''
        for doc in docs
    ]

def vectorestore_function(split_documents_with_metadata, user_vector_store_path):
    try:
        # Create vector store with metadata
        embeddings = OpenAIEmbeddings(
            model = "text-embedding-ada-002", 
            openai_api_key=OPENAI_API_KEY
        )

        vector_store = Chroma(
            embedding_function=embeddings, 
            persist_directory=user_vector_store_path
        )
        
        vector_store.add_documents(documents=split_documents_with_metadata)
        
        return vector_store
    except Exception as e:
        print(f'Error in vectorestore_function {str(e)}')

loader = get_loader_for_extension(saved_file_path)
docs = loader.load()
normalized_docs = normalize_documents(docs)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)
split_docs = text_splitter.create_documents(normalized_docs)

split_documents_with_metadata = [
    Document(page_content=document.page_content, metadata={"user_id": user_id, "doc_id": document_id})
    for document in split_docs
]
vectorestore_function(
    split_documents_with_metadata, 
    user_vector_store_path
)
#Note: I use above (same) code to add or update new data 


# ----------------------------------------------------------- code for interaction with AI -----------------------------------------------------------
def get_vector_store(user_vector_store_path):
    
    embeddings = OpenAIEmbeddings(
            model = "text-embedding-ada-002", 
            openai_api_key=OPENAI_API_KEY
        )
    vectorstore = Chroma(
            embedding_function=embeddings,
            persist_directory=user_vector_store_path
        )    
    return vectorstore
document_id_list = [str(document_id) if isinstance(document_id, int) else document_id for document_id in document_id_list]

user_vector_store_path = os.path.join(VECTOR_STORE_PATH, user_id)        
vectorstore = get_vector_store(user_vector_store_path)

retriever=vectorstore.as_retriever()

current_threshold = 0.25
try:
    # Configure filtering
    retriever.search_type = "similarity_score_threshold"
    retriever.search_kwargs = {
        "filter": {
            "$and": [
                {"user_id": user_id},
                {"doc_id": {"$in": document_id_list}}
            ]
        },
        "score_threshold": current_threshold,
        "k": 3
    }

    retrieved_docs = retriever.invoke(question)
except Exception as e:
    print(f'error: {str(e)}')

print(f"retrieved_docs : {retrieved_docs}")


if not retrieved_docs:
    return jsonify({'error': f'No relevant docs were retrieved.'}), 404

Error Message and Stack Trace (if applicable)

WARNING:langchain_core.vectorstores.base:No relevant docs were retrieved using the relevance score threshold 0.25

Description

I’m facing an issue with my live server. When a new user is created, a new vector database is generated, and everything works fine. If I add more data, it gets stored in the vector database, but I’m unable to retrieve the newly added data.

Interestingly, this issue does not occur in my local environment—it only happens on the live server. To make the new data retrievable, I have to execute pm2 reload "id", as my application is running with PM2. However, if another user is in the middle of a conversation when I reload PM2, the socket connection gets disconnected, disrupting their session.

Tech Stack:
Flutter – Used for the mobile application
Node.js – Used for the back office
Python – Handles data extraction, vector database creation, and conversations
The file download, embedding creation, and vector database updates are handled using Celery.
The server is set up with Apache, and PM2 is used to manage the application process.

Issue:
New data is added to the vector database but cannot be retrieved until pm2 reload "id" is executed.
Reloading PM2 disconnects active socket connections, affecting ongoing user conversations.
What I Want to Achieve:
I want to ensure that the system works seamlessly when a user adds or updates data in the vector database. The new data should be immediately accessible for conversations without requiring a PM2 reload.

In the back office, I am using Socket.IO to send status updates:

socketio.emit('status', {'message': {
    "user_id": user_id,
    "document_id": document_id,
    "status": 200,
    "message": f"Document ID {document_id} processed successfully."
}}, room=room)

This message is successfully emitted, and users can start conversations after receiving it. However, I’m still facing the issue where newly added data is not retrievable until I reload PM2.

Question:
How can I ensure that the system updates the vector database dynamically without requiring a PM2 reload, while keeping active socket connections intact?

System Info

-------------------------------------------------- live server:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7571
CPU family: 23
Model: 1
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
Stepping: 2
BogoMIPS: 4399.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma
cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clze ro xsaveerptr arat npt nrip_save
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 32 KiB (1 instance)
L1i: 64 KiB (1 instance)
L2: 512 KiB (1 instance)
L3: 8 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0,1
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Mitigation; untrained return thunk; SMT vulnerable
Spec rstack overflow: Vulnerable: Safe RET, no microcode
Spec store bypass: Vulnerable
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Srbds: Not affected
Tsx async abort: Not affected

-------------------------------------------------- pip list:

Package Version


aiohappyeyeballs 2.4.4
aiohttp 3.11.11
aiosignal 1.3.2
amqp 5.3.1
annotated-types 0.7.0
anyio 4.8.0
asgiref 3.8.1
async-timeout 4.0.3
attrs 25.1.0
backoff 2.2.1
bcrypt 4.2.1
beautifulsoup4 4.12.3
bidict 0.23.1
billiard 4.2.1
blinker 1.9.0
build 1.2.2.post1
cachetools 5.5.1
celery 5.4.0
certifi 2024.12.14
cffi 1.17.1
charset-normalizer 3.4.1
chroma-hnswlib 0.7.6
chromadb 0.5.23
click 8.1.8
click-didyoumean 0.3.1
click-plugins 1.1.1
click-repl 0.3.0
colorama 0.4.6
coloredlogs 15.0.1
cryptography 44.0.0
dataclasses-json 0.6.7
Deprecated 1.2.17
distro 1.9.0
dnspython 2.7.0
docx2txt 0.8
durationpy 0.9
et_xmlfile 2.0.0
eventlet 0.39.0
exceptiongroup 1.2.2
fastapi 0.115.7
filelock 3.17.0
Flask 3.1.0
Flask-Cors 5.0.0
Flask-SocketIO 5.5.1
flatbuffers 25.1.24
frozenlist 1.5.0
fsspec 2024.12.0
google-auth 2.38.0
googleapis-common-protos 1.66.0
greenlet 3.1.1
grpcio 1.70.0
h11 0.14.0
httpcore 1.0.7
httptools 0.6.4
httpx 0.28.1
httpx-sse 0.4.0
huggingface-hub 0.27.1
humanfriendly 10.0
idna 3.10
importlib_metadata 8.5.0
importlib_resources 6.5.2
itsdangerous 2.2.0
Jinja2 3.1.5
jiter 0.8.2
jsonpatch 1.33
jsonpointer 3.0.0
kombu 5.4.2
kubernetes 32.0.0
langchain 0.3.15
langchain-chroma 0.2.0
langchain-community 0.3.15
langchain-core 0.3.31
langchain-openai 0.3.2
langchain-text-splitters 0.3.5
langsmith 0.3.1
lxml 5.3.0
markdown-it-py 3.0.0
MarkupSafe 3.0.2
marshmallow 3.26.0
mdurl 0.1.2
mmh3 5.1.0
monotonic 1.6
mpmath 1.3.0
multidict 6.1.0
mypy-extensions 1.0.0
numpy 1.26.4
oauthlib 3.2.2
onnxruntime 1.20.1
openai 1.60.1
openpyxl 3.1.5
opentelemetry-api 1.29.0
opentelemetry-exporter-otlp-proto-common 1.29.0
opentelemetry-exporter-otlp-proto-grpc 1.29.0
opentelemetry-instrumentation 0.50b0
opentelemetry-instrumentation-asgi 0.50b0
opentelemetry-instrumentation-fastapi 0.50b0
opentelemetry-proto 1.29.0
opentelemetry-sdk 1.29.0
opentelemetry-semantic-conventions 0.50b0
opentelemetry-util-http 0.50b0
orjson 3.10.15
overrides 7.7.0
packaging 24.2
pandas 2.2.3
pdf2image 1.17.0
pdfminer.six 20231228
pdfplumber 0.11.5
pillow 11.1.0
pip 22.0.2
posthog 3.10.0
prompt_toolkit 3.0.50
propcache 0.2.1
protobuf 5.29.3
pyasn1 0.6.1
pyasn1_modules 0.4.1
pycparser 2.22
pydantic 2.10.6
pydantic_core 2.27.2
pydantic-settings 2.7.1
Pygments 2.19.1
PyMySQL 1.1.1
pyOpenSSL 25.0.0
pypdfium2 4.30.1
PyPika 0.48.9
pyproject_hooks 1.2.0
pyreadline3 3.5.4
pytesseract 0.3.13
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-engineio 4.11.2
python-pptx 1.0.2
python-socketio 5.12.1
pytz 2024.2
PyYAML 6.0.2
redis 5.2.1
regex 2024.11.6
requests 2.32.3
requests-oauthlib 2.0.0
requests-toolbelt 1.0.0
rich 13.9.4
rsa 4.9
setuptools 59.6.0
shellingham 1.5.4
simple-websocket 1.1.0
six 1.17.0
sniffio 1.3.1
soupsieve 2.6
SQLAlchemy 2.0.37
starlette 0.45.3
sympy 1.13.3
tenacity 9.0.0
tiktoken 0.8.0
tokenizers 0.20.3
tomli 2.2.1
tqdm 4.67.1
typer 0.15.1
typing_extensions 4.12.2
typing-inspect 0.9.0
tzdata 2025.1
urllib3 2.3.0
uvicorn 0.34.0
uvloop 0.21.0
vine 5.1.0
watchfiles 1.0.4
wcwidth 0.2.13
websocket-client 1.8.0
websockets 14.2
Werkzeug 3.1.3
wrapt 1.17.2
wsproto 1.2.0
xlrd 2.0.1
XlsxWriter 3.2.1
yarl 1.18.3
zipp 3.21.0
zstandard 0.23.0

@dosubot dosubot bot added the Ɑ: vector store Related to vector store module label Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: vector store Related to vector store module
Projects
None yet
Development

No branches or pull requests

1 participant