Running multiple models via Python server #119

semoal · 2024-02-27T15:47:39Z

semoal
Feb 27, 2024

We would like to run several models using Infinity but we have found that initialising two models, when running the embedding model, runs the latest one that was started.

We have seen your guide about running multiple models via CLI; but we would like to use a Pythonic way of doing this since we would like to expand this API with more endpoints, like Sparse models etc.

We load these models like this:

Just a simple class to store the engines:

class Memory:
    embeddings_args: Optional[EngineArgs] = None
    reranker_args: Optional[EngineArgs] = None
    embedding_model: Optional[AsyncEmbeddingEngine] = None
    reranker_model: Optional[AsyncEmbeddingEngine] = None

    def __init__(self, embeddings_args, reranker_args):
        self.embeddings_args = embeddings_args
        self.reranker_args = reranker_args

    async def astart(self):
        self.embedding_model = AsyncEmbeddingEngine.from_args(self.embeddings_args)
        await self.embedding_model.astart()
        self.reranker_model = AsyncEmbeddingEngine.from_args(self.reranker_args)
        await self.reranker_model.astart()

We initialize the memory, and in the startup lifespan of FastAPI we start both of them:

models = Memory(
    embeddings_args=embeddings_args,
    reranker_args=reranker_args,
)

@asynccontextmanager
async def lifespan(app: FastAPI):
    instrumentator.expose(app)
    # Load the ML model
    await models.astart()
    logger.info(docs.startup_message(host="localhost", port="8000", prefix=""))
    yield
    # Clean up the ML models and release the resources
    await models.ateardown()

Any suggestion @michaelfeil ? Should we stay with the CLI approach or there's something we can do to select the models programatically.

Answered by michaelfeil

Feb 27, 2024

Looks good - thats how two classes are intended to use. Here is how I have build it myself: https://github.com/michaelfeil/infinity/blob/main/docs/benchmarks/simple_app.py
If async def teardown function uses await astop on both.

Additional:

Further, I would run low batch sizes, such as 16 or 32 to not go out-of-memory for edge case requests.
make sure you have hf_transfer installed for max transfer speed.
pin the version of infinity -> New updates might break compatibility (the space is evolving fast).

View full answer

michaelfeil · 2024-02-27T16:35:47Z

michaelfeil
Feb 27, 2024
Maintainer

Looks good - thats how two classes are intended to use. Here is how I have build it myself: https://github.com/michaelfeil/infinity/blob/main/docs/benchmarks/simple_app.py
If async def teardown function uses await astop on both.

Additional:

Further, I would run low batch sizes, such as 16 or 32 to not go out-of-memory for edge case requests.
make sure you have hf_transfer installed for max transfer speed.
pin the version of infinity -> New updates might break compatibility (the space is evolving fast).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running multiple models via Python server #119

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Running multiple models via Python server #119

semoal Feb 27, 2024

Replies: 1 comment

michaelfeil Feb 27, 2024 Maintainer

semoal
Feb 27, 2024

michaelfeil
Feb 27, 2024
Maintainer