Running multiple models via Python server #119
-
We would like to run several models using Infinity but we have found that initialising two models, when running the embedding model, runs the latest one that was started. We have seen your guide about running multiple models via CLI; but we would like to use a Pythonic way of doing this since we would like to expand this API with more endpoints, like Sparse models etc. We load these models like this: Just a simple class to store the engines: class Memory:
embeddings_args: Optional[EngineArgs] = None
reranker_args: Optional[EngineArgs] = None
embedding_model: Optional[AsyncEmbeddingEngine] = None
reranker_model: Optional[AsyncEmbeddingEngine] = None
def __init__(self, embeddings_args, reranker_args):
self.embeddings_args = embeddings_args
self.reranker_args = reranker_args
async def astart(self):
self.embedding_model = AsyncEmbeddingEngine.from_args(self.embeddings_args)
await self.embedding_model.astart()
self.reranker_model = AsyncEmbeddingEngine.from_args(self.reranker_args)
await self.reranker_model.astart() We initialize the memory, and in the startup lifespan of FastAPI we start both of them: models = Memory(
embeddings_args=embeddings_args,
reranker_args=reranker_args,
)
@asynccontextmanager
async def lifespan(app: FastAPI):
instrumentator.expose(app)
# Load the ML model
await models.astart()
logger.info(docs.startup_message(host="localhost", port="8000", prefix=""))
yield
# Clean up the ML models and release the resources
await models.ateardown() Any suggestion @michaelfeil ? Should we stay with the CLI approach or there's something we can do to select the models programatically. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Looks good - thats how two classes are intended to use. Here is how I have build it myself: https://github.com/michaelfeil/infinity/blob/main/docs/benchmarks/simple_app.py Additional:
|
Beta Was this translation helpful? Give feedback.
Looks good - thats how two classes are intended to use. Here is how I have build it myself: https://github.com/michaelfeil/infinity/blob/main/docs/benchmarks/simple_app.py
If
async def teardown
function usesawait astop
on both.Additional:
hf_transfer
installed for max transfer speed.