Skip to content

Commit

Permalink
[DOC] Update Full DeviceConfig Example for release v1.1.0 (#54)
Browse files Browse the repository at this point in the history
* Update Full DeviceConfig Example

* Update docs/fulldeviceconfig.rst

Co-authored-by: Matt Elliott <[email protected]>

* Update docs/fulldeviceconfig.rst

Co-authored-by: Matt Elliott <[email protected]>

* Update docs/fulldeviceconfig.rst

Co-authored-by: Matt Elliott <[email protected]>

* Update docs/fulldeviceconfig.rst

Co-authored-by: Matt Elliott <[email protected]>

* Apply more suggestions

---------

Co-authored-by: Matt Elliott <[email protected]>
  • Loading branch information
yansun1996 and AMD-melliott authored Feb 4, 2025
1 parent c2bde6c commit b870c8d
Showing 1 changed file with 70 additions and 16 deletions.
86 changes: 70 additions & 16 deletions docs/fulldeviceconfig.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,19 +28,26 @@ Below is an example of a full DeviceConfig CR that can be used to install the AM
.. code-block:: yaml
apiVersion: amd.com/v1alpha1
kind: DeviceConfig #New Custom Resource Definition used by the GPU Operator
kind: DeviceConfig # New Custom Resource Definition used by the GPU Operator
metadata:
# Name of the DeviceConfig CR. Note that the name of device plugin, node-labeller and metric-explorter pods will be prefixed with
name: gpu-operator
namespace: kube-amd-gpu # Namespace for the GPU Operator and it's components
# Name that will prefix device plugin, node-labeller and metrics-exporter pods
name: gpu-operator
# Namespace where the GPU Operator and its components will run
namespace: kube-amd-gpu
spec:
## AMD GPU Driver Configuration ##
driver:
# Set to false to skip driver installation to use inbox or pre-installed driver on worker nodes
# Set to True to enable operator to install out-of-tree amdgpu kernel module
# Set to false to use existing in-tree/pre-installed driver
# Set to true to install out-of-tree amdgpu kernel module
# Default: true
enable: false
blacklist: false # Set to true to blacklist the amdgpu kernel module which is required for installing out-of-tree driver
# Specify your repository to host driver image
# Set blacklist to true to blacklist the inbox / pre-installed amdgpu kernel module
# Required when spec.driver.enable is true
# GPU Worker node reboot is required to apply the blacklist
blacklist: false
# Specify the out-of-tree amdgpu driver version you want to install that coincides with a ROCm version number
version: "6.3"
# Specify your repository URL to host driver image for out-of-tree amdgpu kernel module
# DO NOT include the image tag as AMD GPU Operator will automatically manage the image tag for you
image: docker.io/username/repo
# (Optional) Specify the credential for your private registry if it requires credential to get pull/push access
Expand All @@ -49,23 +56,70 @@ Below is an example of a full DeviceConfig CR that can be used to install the AM
# Make sure you created the secret within the namespace that KMM operator is running
imageRegistrySecret:
name: mysecret
# (Optional) Specify your image registry's TLS config
imageRegistryTLS:
insecure: False # If True, check for the container image using plain HTTP
insecureSkipTLSVerify: False # If True, skip any TLS server certificate validation (useful for self-signed certificates)
version: "6.3" # Specify the driver version you would like to be installed that coincides with a ROCm version number
# (Optional) Specify configuration to sign the driver image
# Will be used when there is no pre-compiled driver image
# and operator is building + signing driver image in one shot within cluster
# necessary for secure boot enabled system
imageSign:
# the private key used to sign kernel modules within image
keySecret:
name: my-key-secret
# the public key used to sign kernel modules within image
certSecret:
name: my-cert-secret
## AMD K8s Device Plugin Configuration ##
devicePlugin:
# (Optional) Specifying image names are optional. Default image names for shown here if not specified.
devicePluginImage: rocm/k8s-device-plugin:latest # Change this to trigger metrics exporter upgrade on CR update
nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest # Change this to trigger metrics exporter upgrade on CR update
# (Optional) Specify image registry secret to pull device plugin and node labeller images if needed.
imageRegistrySecret:
name: my-deviceplugin-image-secret
# (Optional) Enable or disable node labeller, default value is true
enableNodeLabeller: true
## AMD GPU Metrics Exporter Configuration ##
metricsExporter:
enable: False # False by Default. Set to True to enable the Metrics Exporter
serviceType: ClusterIP # ServiceType used to expose the Metrics Exporter endpoint. Can be either `ClusterIp` or `NodePort`.
port: 5000 # Used to specify Port the Metrics Exporter service is exposed on when using ClusterIP serviceType
nodePort: 32500 # Used instead of `port` when using NodePort as the serviceType. The port number must be between 30000-32767
# (Optional) Specifying metrics exporter image is optional. Default imagename shown here if not specified.
image: rocm/device-metrics-exporter:latest # Change this to trigger metrics exporter upgrade on CR update
metricsExporter:
# Enable metrics collection and exposure (Default: false)
enable: False
# Service type for metrics endpoint exposure
# Values: ClusterIP, NodePort
# Default: ClusterIP
serviceType: ClusterIP
# Port for metrics endpoint when using ClusterIP
# Default: 5000
port: 5000
# Port for metrics endpoint when using NodePort
# Valid range: 30000-32767
# Default: 32500
nodePort: 32500
# Container image for metrics exporter
# Default: rocm/device-metrics-exporter:latest
image: rocm/device-metrics-exporter:latest
# Private registry credentials (optional)
imageRegistrySecret:
name: exporter-image-pull-secret
# Custom metrics exporter configuration (optional)
config:
name: exporter-configmap
# RBAC Proxy Configuration for secure metrics endpoint access (optional)
rbacConfig:
# Enable RBAC authentication proxy (Default: false)
# When enabled, provides authentication and authorization for metrics endpoint
enable: false
# RBAC proxy container image
# Default: quay.io/brancz/kube-rbac-proxy:v0.18.1
image: "quay.io/brancz/kube-rbac-proxy:v0.18.1"
# TLS configuration for metrics endpoint
# Set true to disable HTTPS
disableHttps: false
# TLS certificate configuration
# Default: Auto-generated self-signed certificates
secret:
name: my-kube-rbac-proxy-cert
# If specifying a node selector here, the metrics exporter will only be deployed on nodes that match the selector
# See Item #6 on https://dcgpu.docs.amd.com/projects/gpu-operator/en/latest/knownlimitations.html for example usage
selector:
Expand Down

0 comments on commit b870c8d

Please sign in to comment.