Generative AI Use Cases {#ovms_docs_clients_genai}

---
maxdepth: 1
hidden:
---

Chat completion API <ovms_docs_rest_api_chat>
Completions API <ovms_docs_rest_api_completion>
Embeddings API <ovms_docs_rest_api_embeddings>
Reranking API <ovms_docs_rest_api_rerank>

Introduction

Beside Tensorflow Serving API (/v1) and KServe API (/v2) frontends, the model server supports a range of endpoints for generative use cases (v3). They are extendible using MediaPipe graphs. Currently supported endpoints are:

OpenAI compatible endpoints:

chat/completions
completions
embeddings Cohere Compatible endpoint:
rerank

OpenAI API Clients

When creating a Python-based client application, you can use OpenAI client library - openai.

Alternatively, it is possible to use just a curl command or requests python library.

Install the Package

pip3 install openai
pip3 install requests
pip3 install cohere

Request chat completions with unary calls

::::{tab-set} :::{tab-item} python [OpenAI] :sync: python-openai

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v3", api_key="unused")
response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=False,
)

print(response.choices[0].message)

::: :::{tab-item} python [requests] :sync: python-requests

import requests
payload = {"model": "meta-llama/Llama-2-7b-chat-hf", "messages": [ {"role": "user","content": "Say this is a test" }]}
headers = {"Content-Type": "application/json", "Authorization": "not used"}
response = requests.post("http://localhost:8000/v3/chat/completions", json=payload, headers=headers)
print(response.text)

::: :::{tab-item} curl :sync: curl

curl http://localhost:8000/v3/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-2-7b-chat-hf", "messages": [ {"role": "user","content": "Say this is a test" }]}'

::: ::::

Check LLM quick start and end to end demo of text generation.

Request completions with unary calls

::::{tab-set} :::{tab-item} python [OpenAI] :sync: python-openai

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v3", api_key="unused")
response = client.completions.create(
    model="meta-llama/Llama-2-7b",
    prompt="Say this is a test",
    stream=False,
)

print(response.choices[0].text)

::: :::{tab-item} python [requests] :sync: python-requests

import requests
payload = {"model": "meta-llama/Llama-2-7b", "prompt": "Say this is a test"}
headers = {"Content-Type": "application/json", "Authorization": "not used"}
response = requests.post("http://localhost:8000/v3/completions", json=payload, headers=headers)
print(response.text)

::: :::{tab-item} curl :sync: curl

curl http://localhost:8000/v3/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-2-7b", "prompt": "Say this is a test"}'

::: :::: Check LLM quick start and end to end demo of text generation.

Request chat completions with streaming

::::{tab-set} :::{tab-item} python [OpenAI] :sync: python-openai

from openai import OpenAI
client = OpenAI(
  base_url="http://localhost:8000/v3",
  api_key="unused"
)

stream = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[{"role": "user", "content": "Say this is a test"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

::: :::: Check LLM quick start and end to end demo of text generation.

Request completions with streaming

::::{tab-set} :::{tab-item} python [OpenAI] :sync: python-openai

from openai import OpenAI
client = OpenAI(
  base_url="http://localhost:8000/v3",
  api_key="unused"
)

stream = client.completions.create(
    model="meta-llama/Llama-2-7b",
    prompt="Say this is a test",
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].text is not None:
        print(chunk.choices[0].text, end="")

::: :::: Check LLM quick start and end to end demo of text generation.

Text embeddings

::::{tab-set} :::{tab-item} python [OpenAI] :sync: python-openai

from openai import OpenAI
client = OpenAI(
  base_url="http://localhost:8000/v3",
  api_key="unused"
)
responses = client.embeddings.create(input=['hello world'], model='Alibaba-NLP/gte-large-en-v1.5')
for data in responses.data:
    print(data.embedding)

::: :::{tab-item} python [requests] :sync: python-requests

import requests
payload = {"model": "Alibaba-NLP/gte-large-en-v1.5", "input": "hello world"}
headers = {"Content-Type": "application/json", "Authorization": "not used"}
response = requests.post("http://localhost:8000/v3/embeddings", json=payload, headers=headers)
print(response.text)

::: :::{tab-item} curl :sync: curl

curl http://localhost:8000/v3/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "Alibaba-NLP/gte-large-en-v1.5", "input": "hello world"}'

::: :::: Check text embeddings end to end demo.

Cohere Python Client

Clients can use rerank endpoint via cohere python package - cohere.

Just like with openAI endpoints and alternative is in curl command or requests python library.

Install the Package

pip3 install cohere
pip3 install requests

Documents reranking

::::{tab-set} :::{tab-item} python [Cohere] :sync: python-cohere

import cohere
client = cohere.Client(base_url='http://localhost:8000/v3', api_key="not_used")
responses = client.rerank(query="Hello",documents=["Welcome","Farewell"], model='BAAI/bge-reranker-large')
for res in responses.results:
    print(res.index, res.relevance_score)

::: :::{tab-item} python [requests] :sync: python-requests

import requests
payload = {"model": "BAAI/bge-reranker-large", "query": "Hello", "documents":["Welcome","Farewell"]}
headers = {"Content-Type": "application/json", "Authorization": "not used"}
response = requests.post("http://localhost:8000/v3/rerank", json=payload, headers=headers)
print(response.text)

::: :::{tab-item} curl :sync: curl

curl http://localhost:8000/v3/rerank \
  -H "Content-Type: application/json" \
  -d '{"model": "BAAI/bge-reranker-large", "query": "Hello", "documents":["Welcome","Farewell"]}'

::: :::: Check documents reranking end to end demo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clients_genai.md

clients_genai.md

Generative AI Use Cases {#ovms_docs_clients_genai}

Introduction

OpenAI API Clients

Install the Package

Request chat completions with unary calls

Request completions with unary calls

Request chat completions with streaming

Request completions with streaming

Text embeddings

Cohere Python Client

Install the Package

Documents reranking

Files

clients_genai.md

Latest commit

History

clients_genai.md

File metadata and controls

Generative AI Use Cases {#ovms_docs_clients_genai}

Introduction

OpenAI API Clients

Install the Package

Request chat completions with unary calls

Request completions with unary calls

Request chat completions with streaming

Request completions with streaming

Text embeddings

Cohere Python Client

Install the Package

Documents reranking