gpu_operator_helm_chart_v1.0.0
AMD GPU Operator v1.0.0 Release Notes
This release is the first major release of AMD GPU Operator. The AMD GPU Operator simplifies the deployment and management of AMD Instinct™ GPU accelerators within Kubernetes clusters. This project enables seamless configuration and operation of GPU-accelerated workloads, including machine learning, Generative AI, and other GPU-intensive applications.
Release Highlights
- Manage AMD GPU drivers with desired versions on Kubernetes cluster nodes
- Customized scheduling of AMD GPU workloads within Kubernetes cluster
- Metrics and statistics monitoring solution for AMD GPU hardware and workloads
- Support specialized networking environment like HTTP proxy or Air-gapped network
Hardware Support
New Hardware Support
-
AMD Instinct™ MI300
- Required driver version: ROCm 6.2+
-
AMD Instinct™ MI250
- Required driver version: ROCm 6.2+
-
AMD Instinct™ MI210
- Required driver version: ROCm 6.2+
Platform Support
New Platform Support
- Kubernetes 1.29+
- Supported features:
- Driver management
- Workload scheduling
- Metrics monitoring
- Requirements: Kubernetes version 1.29+
- Supported features:
Breaking Changes
Not Applicable as this is the initial release.
New Features
Feature Category
-
Driver management
- Managed Driver Installations: Users will be able to install ROCm 6.2+ dkms driver on Kubernetes worker nodes, they can also optionally choose to use inbox or pre-installed driver on the worker nodes
- DeviceConfig Custom Resource: Users can configure a new DeviceConfig CRD (Custom Resource Definition) to define the driver management behavior of the GPU Operator
-
GPU Workload Scheduling
- Custom Resource Allocation "amd.com/gpu": After the deployment of the GPU Operator a new custom resource allocation will be present on each GPU node,
amd.com/gpu
, which will list the allocatable GPU resources on the node for which GPU workloads can be scheduled against - Assign Multiple GPUs: Users can easily specify the number of AMD GPUs required by each workload in the deployment/pod spec and the Kubernetes scheduler wiill automatically take care of assigning the correct GPU resources
- Custom Resource Allocation "amd.com/gpu": After the deployment of the GPU Operator a new custom resource allocation will be present on each GPU node,
-
Metrics Monitoring for GPUs and Workloads:
- Out-of-box Metrics: Users can optionally enable the AMD Device Metrics Exporter when installing the AMD GPU Operator to enable a robust out-of-box monitoring solution for prometheus to consume
- Custom Metrics Configurations: Users can utilize a configmap to customize the configuration and behavior of Device Metrics Exporter
-
Specialized Network Setups:
- Air-gapped Installation: Users can install the GPU Operator in a secure air-gapped environment where the Kubernetes cluster has no external network connectivity
- HTTP Proxy Support: The AMD GPU Operator supports usage within a Kubernetes cluster that is behind a HTTP Proxy. HTTPS support to be added in future release.
Known Limitations
-
GPU operator driver installs only DKMS package
- Impact: Applications which require ROCM packages will need to install respective packages.
- Affectioned Configurations: All configurations
- Workaround: None as this is the intended behaviour as other ROCm software packages should be managed inside the containers/workloads running on the cluster
-
When Using Operator to install amdgpu 6.1.3/6.2 a reboot is required to complete install
- Impact: Node requires a reboot when upgrade is initiated due to ROCm bug. Driver install failures may be seen in dmesg
- Affected configurations: Nodes with driver version >= ROCm 6.2.x
- Workaround: Reboot the nodes upgraded manually to finish the driver install. This has been fixed in ROCm 6.3+
-
GPU Operator unable to install amdgpu driver if existing driver is already installed
- Impact: Driver install will fail if amdgpu in-box Driver is present/already installed
- Affected Configurations: All configurations
- Workaround: When installing the amdgpu drivers using the GPU Operator, worker nodes should have amdgpu blacklisted or amdgpu drivers should not be pre-installed on the node. Blacklist in-box driver so that it is not loaded or remove the pre-installed driver
-
When GPU Operator is used in SKIP driver install mode, if amdgpu module is removed with device plugin installed it will not reflect active GPU available on the server
- Impact: Scheduling Workloads will have impact as it will scheduled on nodes which does have active GPU.
- Affectioned Configurations: All configurations
- Workaround: Restart the Device plugin pod deployed.
-
Worker nodes where Kernel needs to be upgraded needs to taken out of the cluster and readded with Operator installed
- Impact: Node upgrade will not proceed automatically and requires manual intervention
- Affected Configurations: All configurations
- Workaround: Manually mark the node as unschedulable, preventing new pods from being scheduled on it, by cordoning it off:
kubectl cordon <node-name>
-
When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module
-
Impact: Driver upgrade is blocked
-
Affectioned Configurations: All configurations
-
Workaround: Disable the Metrics Exporter to allow driver upgrade by updating the deviceconfig in the gpu-operator namespace to set metrics exporter to enabled to false:
kubectl patch deviceconfig gpu-operator -n kube-amd-gpu --type='json' -p='[{"op": "replace", "path": "/spec/metricsExporter/enable", "value": false}]'
-