Add a reliability mechanism #45

dgarnitz · 2023-09-08T20:37:50Z

The hugging face, vdb upload and open ai embeddings workers all need a retry mechanism.

The queue system could be leveraged for this, either a general retry queue at each stage or for each individual worker.

There should be logic to prevent retries when critical system components are down (like open AI's api or a vector DB's host)

dgarnitz · 2023-11-09T00:06:39Z

I recently added a basic retry mechanism to the worker.py in this PR here. Its a naive implementation of retry, where the system retries a batch up to 3 times by putting it back on the embeddings queue.

What we need to do

Create a retry queue for each existing queue.
Create a dead letter queue, aka dlq, that holds messages that have already been retried 3 times
Create a cron job or scheduled task that
a) moves things from the retry queue back to the main queue
b) queries batches that are more than 24 hours old and marks them as FAILED.
I think this can run once per hour to start.
Add logic to hugging_face/app.py, worker/vdb_upload_worker.py that puts failed batches onto the retry queue. Be selective about where and when you choose to do this. If something fails because a key is missing or a connection URL is wrong, it shouldn't be retried. Probably retries only make sense for very specific types of exceptions
Alter the logic in worker.py to use the retry queue. If something has been retried the maximum number of times, add logic to put it onto the DLQ

Other System Notes

VectorFlow currently has 4 queues:

extraction - queue holds pointer to file that will be turned into batches
embeddings - queue holds batches, will get turned into chunks and either embedded with open AI embeddings or passed to the hugging face model queue for embedding
hugging face model queue - holds chunks for embedding with a hugging face sentence transformer model. Here the name of the queue is the name of the model
vector database upload - holds chunks & vector embeddings that will be uploaded to a vector store

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a reliability mechanism #45

Add a reliability mechanism #45

dgarnitz commented Sep 8, 2023

dgarnitz commented Nov 9, 2023

Add a reliability mechanism #45

Add a reliability mechanism #45

Comments

dgarnitz commented Sep 8, 2023

dgarnitz commented Nov 9, 2023

What we need to do

Other System Notes