The contents in this repository is for demonstrating the LLM inference with asynchronous stream using the APIs from OPENAI, NVIDIA NIM, or NAVER HYPERCLOVA on FastAPI.
-
Prepare for the keys for the LLM platforms(NGC, OPENAI, Clova Studio)
-
Assign the keys within
key_config.env
file.$ source key_config.env
- Then check if those keys are propely assigned as environment variables by executing
env
command.
$ env SHELL=/bin/bash NGC_CLI_API_KEY=xxxxx ... NVIDIA_API_KEY=nvapi-.... ...
- Then check if those keys are propely assigned as environment variables by executing
-
Deploy NVIDA NIM on your server for the self-Hosted API.
- Local deployment of NIM service requires the NVAIE license.
- For example, how to deploy NIM for mistralai/mistral-7b-instruct-v0.3 on your host is described in NGC.
-
Install the python dependencies.
$ pip3 install -r requirements.txt
-
Launch the FastAPI server.
python3 launch.py
-
Inference.
- The inference works as an asynchronous stream.
python3 client.py -p "nim" -q "who is the president of Korea?"
- Sample output is
[PLATFORM]: NIM [QUERY]: who is the president of Korea? [STATUS CODE]: 200 [STREAMING RESPONSES] {"status": "processing", "data": " Moon"} {"status": "processing", "data": " J"} {"status": "processing", "data": "ae"} {"status": "processing", "data": "-"} {"status": "processing", "data": "in"} {"status": "processing", "data": " is"} {"status": "processing", "data": " the"} {"status": "processing", "data": " current"} {"status": "processing", "data": " president"} {"status": "processing", "data": " of"} {"status": "processing", "data": " South"} {"status": "processing", "data": " Korea"} {"status": "processing", "data": "."} {"status": "complete", "data": "Stream finished"} [FULL RESPONSE] Moon Jae-in is the current president of South Korea.