Open Data Hub and Red Hat OpenShift AI both use Kserve as the infrastructure for serving models.
The default setup is that Kserve is configured to pull the models from an s3 compatible store (e.g AWS S3, IBM COS, Minio, Red Hat MultiCloudGateway etc.).
The process that occurs is that when Kserve initializes InferenceService
runtime's pod.
An init container downloads the model into an emptyDir
volume mount.
This enables the inference server to exploit the peformance of locally attached storage if the server crashes as needs to restart.
However, some environments are constrained:
-
There is insufficient local storage to cache the model on local disk.
- Filling local storage on nodes can have adverse side effects including thrashing as the node tries to eject enough workload to obtain the minimum required free space
-
S3 is not universal on-premise.
- While services such as Minio or NooBaa Multi Cloud Gateway can be deployed on the cluster, in the end they will then be consuming from PVCs.
As an alternative this repository documents the process for using a PVC for storage of a model and therefore avoiding requiring s3 and local emptyDir
volume mounts
-
At this stage the repository provides an example of doing so not a customized pipeline.
- Minimal templating has been used.
-
The example was built on a Red Hat OpenShift on AWS cluster. There are references to AWS which may not apply (such as gp3 st)
-
This has only being tested with a "Single Model Server" such as vLLM
- Convert the process into a templated workflow.
In order to serve a model from a PVC the storage class must have a RECLAIMPOLICY
of Retain
.
In the case of ths cluster none of the storage classes supported the appropriate reclaim policy (see below):
oc get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp2 kubernetes.io/aws-ebs Delete WaitForFirstConsumer true 43h
gp2-csi ebs.csi.aws.com Delete WaitForFirstConsumer true 42h
gp3 (default) ebs.csi.aws.com Delete WaitForFirstConsumer true 43h
gp3-csi ebs.csi.aws.com Delete WaitForFirstConsumer true 42h
However, reclaim is a cluster side controlled policy (and not a feature controlled by providers):
- Pull down the closest storage class you want to use:
oc get sc gp3-csi -o yaml > .sampleManifests/original-gp3-storage-class.yaml
- Copy to a clean file. Remove
uuid
fields and changereclaimPolicy: Delete
toreclaimPolicy: Retain
- Apply the new manifest
oc apply -f ./sampleMainfest/gp3-sc-with-retain.yaml
The result should be an additional storage class:
oc get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp2 kubernetes.io/aws-ebs Delete WaitForFirstConsumer true 43h
gp2-csi ebs.csi.aws.com Delete WaitForFirstConsumer true 43h
gp3 (default) ebs.csi.aws.com Delete WaitForFirstConsumer true 43h
gp3-csi ebs.csi.aws.com Delete WaitForFirstConsumer true 43h
gp3-csi-retain ebs.csi.aws.com Retain WaitForFirstConsumer true 159m
This section is assumed to be run from your notebook server on the cluster with
-
oc login
to the cluster in your notebook e.g.!oc login --token ..
-
Create a pvc and apply the pvc (named:
vllm-model-cache
)oc apply -f ./sampleManifest/model-pvc.yaml
-
Create a pod which does nothing, and contains
tar
to enable the copyoc apply -f ./sampleManifest/model-storage-pod.yaml
-
Use
oc cp
to copy data up to the pvcoc cp granite-7b-lab model-store-pod-ubi:/pv/granite-7b-lab -c model-store
-
In the case above,
granite-7b-lab
is the relative path to the model in the notebook -
Delete the pod but not the pvc
oc delete -f ./sampleManifest/model-storage-pod.yaml
The model is now in a pvc ready to serve!
When deploying a single model server in ODH / Red Hat OpenShift AI two objects are created:
- A
servingRuntime
- A
servingInstance
The easiest way to get valid objects is to attempt to deploy a model using the UI first.
This example deployed a vLLM runtime called granite
to the llm
namespace.
With this the objects are retrievalble
oc get servingruntimes.serving.kserve.io -n llm granite -o yaml > ./sampleManifests/granite-servingRuntime.yaml
oc get inferenceservices.serving.kserve.io -n llm granite -o yaml > ./sampleManifests/granite-instanceServices.yaml
-
The
servingRuntime
purely needs name references changed. E.g. in this case findgranite
and replace withgranite-pvc
- new file is here
-
The
instanceServices
need:- A name change from
granite
togranite-pvc
. Theruntime:
element MUST match the name of theservingRuntime
- A name change from
-
The storage needs to be changed. Note this path is hard coded with respect to what you upload and the pvc name
Original storage
...
runtime: granite
storage:
key: aws-connection-minio
path: granite-7b-lab
...
New storage
...
runtime: granite-pvc
storageUri: pvc://vllm-model-cache/granite-7b-lab
...
The full manifest is here
The PVC specified must be in the same namespace
Once all of this is complete you can apply the manifests to your cluster:
oc apply -f ./sampleManifests/new-granite-pvc-servingRuntime.yaml
oc apply -f ./sampleManifests/new-granite-pvc-instanceServices.yaml
Logging into the node and using du to inspect /var/lib/kubelet/pods
should show that pods do not use local emptyDir
storage.