You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently added a basic retry mechanism to the worker.py in this PR here. Its a naive implementation of retry, where the system retries a batch up to 3 times by putting it back on the embeddings queue.
What we need to do
Create a retry queue for each existing queue.
Create a dead letter queue, aka dlq, that holds messages that have already been retried 3 times
Create a cron job or scheduled task that
a) moves things from the retry queue back to the main queue
b) queries batches that are more than 24 hours old and marks them as FAILED.
I think this can run once per hour to start.
Add logic to hugging_face/app.py, worker/vdb_upload_worker.py that puts failed batches onto the retry queue. Be selective about where and when you choose to do this. If something fails because a key is missing or a connection URL is wrong, it shouldn't be retried. Probably retries only make sense for very specific types of exceptions
Alter the logic in worker.py to use the retry queue. If something has been retried the maximum number of times, add logic to put it onto the DLQ
Other System Notes
VectorFlow currently has 4 queues:
extraction - queue holds pointer to file that will be turned into batches
embeddings - queue holds batches, will get turned into chunks and either embedded with open AI embeddings or passed to the hugging face model queue for embedding
hugging face model queue - holds chunks for embedding with a hugging face sentence transformer model. Here the name of the queue is the name of the model
vector database upload - holds chunks & vector embeddings that will be uploaded to a vector store
The hugging face, vdb upload and open ai embeddings workers all need a retry mechanism.
The queue system could be leveraged for this, either a general retry queue at each stage or for each individual worker.
There should be logic to prevent retries when critical system components are down (like open AI's api or a vector DB's host)
The text was updated successfully, but these errors were encountered: