diff --git a/.gitignore b/.gitignore index b88d814..3d88bce 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,3 @@ -docs/_build/ \ No newline at end of file +# Sphinx documentation +_build/ +.venv/ \ No newline at end of file diff --git a/docs/installation/openshift-helm.md b/docs/installation/openshift-helm.md new file mode 100644 index 0000000..7c745c6 --- /dev/null +++ b/docs/installation/openshift-helm.md @@ -0,0 +1,262 @@ +# OpenShift (Helm) + +```{warning} +Installing via Helm is not a recommended method for Red Hat OpenShift. Users wishing to use the AMD GPU with OpenShift should consider using the OLM method instead. +``` + +This guide walks through installing the AMD GPU Operator on an OpenShift cluster using Helm. + +## Prerequisites + +### OpenShift Requirements + +- OpenShift Container Platform 4.16 or later +- Cluster administrator privileges +- Helm v3.2.0 or later +- `oc` CLI tool configured with cluster access + +### Required OpenShift Operators + +The following operators must be enabled in your OpenShift cluster (enabled by default): + +- **Service-CA Operator** + - Required for certificate signing and webhook authentication + - Verifies communication between kube-api-server and KMM webhook server + +- **MachineConfig Operator** + - Required for configuring the blacklist for `amdgpu` driver + - Manages node-level configuration + +- **Cluster Image Registry Operator** + - Required for driver image builds within OpenShift + - Manages internal image registry storage + - Steps to enable image registry operator if it is disabled (example using emptyDir): + - Configure registry storage: ```oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}'``` + - Enable the registry: ```oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"managementState":"Managed"}}'``` + - Verify the registry pod is running: ```oc get pods -n openshift-image-registry``` + +## Installation Methods + +There are two ways to install the AMD GPU Operator on OpenShift: + +1. [All-in-One Installation](#method-1-all-in-one-installation) +2. [Component-by-Component Installation](#method-2-component-by-component-installation) + +### Method 1: All-in-One Installation + +This method installs the operator and all dependencies using a single Helm chart. + +- Install the operator and dependencies: + +```bash +helm install amd-gpu-operator rocm/gpu-operator-helm \ + --namespace kube-amd-gpu\ + --create-namespace \ + --set platform=openshift +``` + +- Verify the installation: + +```bash +oc get pods -n kube-amd-gpu +``` + +Expected output: + +```bash +NAME READY STATUS RESTARTS AGE +nfd-master-67b568b89c-lvk9k 1/1 Running 0 2m +nfd-worker-nkrgl 1/1 Running 0 2m +amd-gpu-operator-controller-manager-56844b49b4-tk75f 1/1 Running 0 2m +amd-gpu-kmm-controller-78ddd75846-kxd8n 1/1 Running 0 2m +amd-gpu-kmm-webhook-server-749cb8b565-ktbsp 1/1 Running 0 2m +amd-gpu-nfd-controller-manager-77764d98c5-h76pp 2/2 Running 0 2m +``` + +### Method 2: Component-by-Component Installation + +This method allows more control over the installation process by installing dependencies separately. + +#### Step 1: Install Node Feature Discovery (NFD) Operator + +1. Navigate to OpenShift Web Console → OperatorHub +2. Search for "Node Feature Discovery" +3. Select and install the Red Hat version of the operator +4. Choose the default installation options + +#### Step 2: Install Kernel Module Management (KMM) Operator + +1. Navigate to OpenShift Web Console → OperatorHub +2. Search for "Kernel Module Management" +3. Select and install the Red Hat version (without Hub label) +4. Choose the default installation options + +#### Step 3: Install AMD GPU Operator + +Install the operator while skipping the already-installed dependencies: + +```bash +helm install amd-gpu-operator rocm/gpu-operator-helm \ + --namespace kube-amd-gpu \ + --create-namespace \ + --set platform=openshift \ + --set nfd.enabled=false \ + --set kmm.enabled=false +``` + +## Post-Installation Configuration + +### 1. Configure Node Feature Discovery + +Create an NFD rule to detect AMD GPUs: + +```yaml +apiVersion: nfd.openshift.io/v1 +kind: NodeFeatureDiscovery +metadata: + name: amd-gpu-nfd-instance + namespace: kube-amd-gpu +spec: + operand: + image: quay.io/openshift/origin-node-feature-discovery:4.16 + imagePullPolicy: IfNotPresent + servicePort: 12000 + workerConfig: + configData: | + core: + sleepInterval: 60s + sources: + pci: + deviceClassWhitelist: + - "0200" + - "03" + - "12" + deviceLabelFields: + - "vendor" + - "device" + custom: + - name: amd-gpu + labels: + feature.node.kubernetes.io/amd-gpu: "true" + matchAny: + - matchFeatures: + - feature: pci.device + matchExpressions: + vendor: {op: In, value: ["1002"]} + device: {op: In, value: [ + "74a0", # MI300A + "74a1", # MI300X + "740f", # MI210 + "7408", # MI250X + "740c", # MI250/MI250X + "738c", # MI100 + "738e" # MI100 + ]} +``` + +### 2. Create blacklist (for installing out-of-tree kernel module) + +Create a Machine Config Operator custom resource to add `amdgpu` kernel module into the modprobe blacklist, here is an example of custom resource `MachineConfig`, please set `master` for the label `machineconfiguration.openshift.io/role` if you run Single Node OpenShift or `worker` in other scenarios with dedicated controllers. + +```{warning} +After adding `amdgpu` kernel module to blacklist by using `MachineConfig` custom resource, **the Machine Config Operator will automatically reboot selected nodes.** +``` + +```yaml +apiVersion: machineconfiguration.openshift.io/v1 +kind: MachineConfig +metadata: + labels: + machineconfiguration.openshift.io/role: worker + name: amdgpu-module-blacklist +spec: + config: + ignition: + version: 3.2.0 + storage: + files: + - path: "/etc/modprobe.d/amdgpu-blacklist.conf" + mode: 420 + overwrite: true + contents: + source: "data:text/plain;base64,YmxhY2tsaXN0IGFtZGdwdQo=" +``` + +### 3. Create DeviceConfig Resource + +Create a `DeviceConfig` to trigger driver installation: + +```yaml +apiVersion: amd.com/v1alpha1 +kind: DeviceConfig +metadata: + name: amd-gpu-config + namespace: kube-amd-gpu +spec: + driver: + enable: true + image: image-registry.openshift-image-registry.svc:5000/amdgpu_kmod + version: 6.2.2 + selector: + feature.node.kubernetes.io/amd-gpu: "true" +``` + +## Verification + +### 1. Check Node Labels + +Verify GPU detection: + +```bash +oc get nodes -l feature.node.kubernetes.io/amd-gpu=true +``` + +### 2. Check Component Status + +- Verify all pods are running: + +```bash +oc get pods -n kube-amd-gpu +``` + +- Check GPU resource availability: + +```bash +oc get node -o json | jq '.items[].status.capacity."amd.com/gpu"' +``` + +### 3. Check Driver Status + +Monitor driver installation: + +```bash +oc logs -n kube-amd-gpu-l app=kmm-worker +``` + +## Troubleshooting + +### Common Issues + +1. **Certificate Issues** + - Check Service-CA operator status + - Verify webhook certificates are properly mounted + +2. **Driver Build Failures** + - Check builder pod logs + - Verify registry access + - Check available storage + +3. **Node Labeling Issues** + - Verify NFD operator status + - Check NFD worker pods on GPU nodes + - Review NFD rule syntax + +For detailed troubleshooting, run the support tool: + +```bash +./tools/techsupport_dump.sh -w -o yaml +``` + +## Uninstallation + +Please refer to the [Uninstallation](../uninstallation/uninstallation) document for uninstalling related resources. diff --git a/docs/installation/openshift-olm.md b/docs/installation/openshift-olm.md new file mode 100644 index 0000000..ad3bc07 --- /dev/null +++ b/docs/installation/openshift-olm.md @@ -0,0 +1,245 @@ +# OpenShift (OLM) + +This guide explains how to deploy the AMD GPU Operator on OpenShift using the Operator Lifecycle Manager (OLM). + +## Prerequisites + +Before installing the AMD GPU Operator, ensure your OpenShift cluster meets the following requirements: + +### Required Operators + +The following operators must be enabled in your OpenShift cluster (these are typically enabled by default): + +#### Service CA Operator + +- Required for certificate signing and authentication between the kube-apiserver and KMM webhook server +- Verify status: + +```bash +oc get pods -A | grep service-ca +``` + +#### Operator Lifecycle Manager (OLM) + +- Required for managing operator installation and dependencies +- Verify status: + +```bash +oc get pods -A | grep operator-lifecycle +``` + +#### MachineConfig Operator + +- Required for configuring the blacklist for `amdgpu` driver +- Verify status: + +```bash +oc get pods -A | grep machine-config +``` + +#### Cluster Image Registry Operator + +- Required for driver image building and storage within OpenShift cluster +- Verify status: + +```bash +oc get pods -A | grep image-registry +``` + +### Configure Internal Registry + +If you plan to build driver images within the cluster, you must enable the OpenShift internal registry: + +- Verify current registry status: + +```bash +oc get pods -n openshift-image-registry +``` + +- Configure registry storage (example using emptyDir): + +```bash +oc patch configs.imageregistry.operator.openshift.io cluster --type merge \ + --patch '{"spec":{"storage":{"emptyDir":{}}}}' +``` + +- Enable the registry: + +```bash +oc patch configs.imageregistry.operator.openshift.io cluster --type merge \ + --patch '{"spec":{"managementState":"Managed"}}' +``` + +- Verify the registry pod is running: + +```bash +oc get pods -n openshift-image-registry +``` + +## Installation + +### 1. Install Required Dependencies + +- Install Node Feature Discovery (NFD) Operator + +1. Navigate to the OpenShift Web Console +2. Go to OperatorHub +3. Search for "Node Feature Discovery" +4. Select and install the RedHat version of the operator + +- Install Kernel Module Management (KMM) Operator + +1. Navigate to the OpenShift Web Console +2. Go to OperatorHub +3. Search for "Kernel Module Management" +4. Select and install the RedHat version (without Hub label) + +### 2. Install AMD GPU Operator + +Currently, the AMD GPU Operator is not available in OperatorHub. Install it using the Operator SDK: + +1. Set up your environment: + - Install the `kubectl` binary + - Configure access to your OpenShift cluster + - Install [Operator SDK](https://sdk.operatorframework.io/docs/installation/) + +2. Deploy the operator bundle: + +```bash +operator-sdk run bundle docker.io/amd/gpu-operator-bundle:v0.0.1 --namespace=default +``` + +> **Note**: The bundle image URL and tag will be updated in future releases. + +1. Verify the operator deployment: + +```bash +oc get pods +``` + +## Configuration + +### 1. Create Node Feature Discovery Rule + +Create an NFD rule to detect AMD GPU hardware: + +```yaml +apiVersion: nfd.openshift.io/v1 +kind: NodeFeatureDiscovery +metadata: + name: amd-gpu-operator-nfd-instance + namespace: default +spec: + operand: + image: quay.io/openshift/origin-node-feature-discovery:4.16 + imagePullPolicy: IfNotPresent + servicePort: 12000 + workerConfig: + configData: | + core: + sleepInterval: 60s + sources: + pci: + deviceClassWhitelist: + - "0200" + - "03" + - "12" + deviceLabelFields: + - "vendor" + - "device" + custom: + - name: amd-gpu + labels: + feature.node.kubernetes.io/amd-gpu: "true" + matchAny: + - matchFeatures: + - feature: pci.device + matchExpressions: + vendor: {op: In, value: ["1002"]} + device: {op: In, value: [ + "74a0", # MI300A + "74a1", # MI300X + "740f", # MI210 + "7408", # MI250X + "740c", # MI250/MI250X + "738c", # MI100 + "738e" # MI100 + ]} +``` + +Verify the NFD label is applied: + +```bash +oc get node -o yaml | grep "amd-gpu" +``` + +### 2. Create blacklist (for installing out-of-tree kernel module) + +Create a Machine Config Operator custom resource to add `amdgpu` kernel module into the modprobe blacklist, here is an example of custom resource `MachineConfig`, please set `master` for the label `machineconfiguration.openshift.io/role` if you run Single Node OpenShift or `worker` in other scenarios with dedicated controllers. + +```{warning} +After adding `amdgpu` kernel module to blacklist by using `MachineConfig` custom resource, **the Machine Config Operator will automatically reboot selected nodes.** +``` + +```yaml +apiVersion: machineconfiguration.openshift.io/v1 +kind: MachineConfig +metadata: + labels: + machineconfiguration.openshift.io/role: worker + name: amdgpu-module-blacklist +spec: + config: + ignition: + version: 3.2.0 + storage: + files: + - path: "/etc/modprobe.d/amdgpu-blacklist.conf" + mode: 420 + overwrite: true + contents: + source: "data:text/plain;base64,YmxhY2tsaXN0IGFtZGdwdQo=" +``` + +### 3. Create DeviceConfig + +Create a DeviceConfig CR to trigger the GPU driver installation: + +```yaml +apiVersion: amd.com/v1alpha1 +kind: DeviceConfig +metadata: + name: test-cr + namespace: default +spec: + driver: + enable: true + image: image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod + version: 6.2.2 + selector: + "feature.node.kubernetes.io/amd-gpu": "true" +``` + +The operator will: + +1. Collect worker node system specifications +2. Build or retrieve the appropriate driver image +3. Deploy the driver using KMM +4. Deploy the ROCM device plugin and node labeller + +Verify the deployment: + +```bash +# Check KMM worker status +oc get pods | grep kmm-worker + +# Check device plugin and labeller status +oc get pods | grep test-cr + +# Verify GPU resource labels +oc get node -o json | grep amd.com +``` + +## Uninstallation + +Please refer to the [Uninstallation](../uninstallation/uninstallation) document for uninstalling related resources. diff --git a/docs/releasenotes.md b/docs/releasenotes.md index dc6c22d..8b6efdd 100644 --- a/docs/releasenotes.md +++ b/docs/releasenotes.md @@ -2,7 +2,9 @@ The GPU Operator v1.1.0 release adds support for Red Hat OpenShift versions 4.16 and 4.17. The AMD GPU Operator has gone through a rigourous validation process and is now *certified* for use on OpenShift. It can now be deployed via [the Red Hat Catalog](https://catalog.redhat.com/software/container-stacks/detail/6722781e65e61b6d4caccef8). -> **Note:** The latest AMD GPU Operator OLM Bundle for OpenShift is tagged with version v1.1.1 as the operator image has been updated to include a minor driver fix. +```{note} +The latest AMD GPU Operator OLM Bundle for OpenShift is tagged with version v1.1.1 as the operator image has been updated to include a minor driver fix. +``` ## Release Highlights diff --git a/docs/sphinx/_toc.yml b/docs/sphinx/_toc.yml index 825614f..e543678 100644 --- a/docs/sphinx/_toc.yml +++ b/docs/sphinx/_toc.yml @@ -5,13 +5,13 @@ defaults: subtrees: - caption: Usage entries: - - file: overview - - file: releasenotes - title: Release Notes - - file: usage - - file: troubleshooting - - file: knownlimitations - title: Known Limitations + - file: overview + - file: releasenotes + title: Release Notes + - file: usage + - file: troubleshooting + - file: knownlimitations + title: Known Limitations - caption: Installation entries: - file: installation/kubernetes-helm diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 825614f..e543678 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -5,13 +5,13 @@ defaults: subtrees: - caption: Usage entries: - - file: overview - - file: releasenotes - title: Release Notes - - file: usage - - file: troubleshooting - - file: knownlimitations - title: Known Limitations + - file: overview + - file: releasenotes + title: Release Notes + - file: usage + - file: troubleshooting + - file: knownlimitations + title: Known Limitations - caption: Installation entries: - file: installation/kubernetes-helm