From db232ec4cec477f266352ca08566122f6afac308 Mon Sep 17 00:00:00 2001 From: Farshad Ghodsian <47931571+farshadghodsian@users.noreply.github.com> Date: Mon, 23 Dec 2024 18:28:04 -0500 Subject: [PATCH] Added Release Notes, Known Limitations and Contributing section to docs (#31) * Added Release Notes, Known Limitations and Contributing section to docs * Fixed Spelling * Fixed reference to Exporter example config.json --------- Co-authored-by: Farshad Ghodsian --- .wordlist.txt | 3 + docs/conf.py | 4 +- .../contributing/documentation-build-guide.md | 53 +++++ docs/contributing/documentation-standards.md | 207 ++++++++++++++++++ docs/knownlimitations.md | 79 +++++++ docs/metrics/exporter.md | 16 +- docs/releasenotes.md | 158 +++++++++++++ docs/site/metrics/exporter/index.html | 2 +- docs/sphinx/_toc.yml | 13 +- docs/sphinx/_toc.yml.in | 13 +- 10 files changed, 540 insertions(+), 8 deletions(-) create mode 100644 docs/contributing/documentation-build-guide.md create mode 100644 docs/contributing/documentation-standards.md create mode 100644 docs/knownlimitations.md create mode 100644 docs/releasenotes.md diff --git a/.wordlist.txt b/.wordlist.txt index 6a3f3ef..76490f7 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -52,3 +52,6 @@ verison webhook CRD uninstallation +OpenShift +Autobuild +NMC \ No newline at end of file diff --git a/docs/conf.py b/docs/conf.py index 42ac373..bce0355 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -5,8 +5,8 @@ external_projects = ["amd-gpu-operator"] external_projects_current_project = "amd-gpu-operator" -project = "AMD Instinct Hub" -version = "1.0.0" +project = "AMD Instinct Documentation" +version = "1.1.0" release = version html_title = f"AMD GPU Operator {version}" author = "Advanced Micro Devices, Inc." diff --git a/docs/contributing/documentation-build-guide.md b/docs/contributing/documentation-build-guide.md new file mode 100644 index 0000000..b9c9c8c --- /dev/null +++ b/docs/contributing/documentation-build-guide.md @@ -0,0 +1,53 @@ +# Documentation Build Guide + +This guide provides information for developers who want to contribute to the AMD GPU Operator documentation available at https://dcgpu.docs.amd.com/projects/gpu-operator. The docs use [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) as their base and the below guide will show how you can build and serve the docs locally for testing. + +## Building and Serving the Docs + +1. Create a Python Virtual Environment (optional, but recommended) + + ```bash + python3 -m venv .venv/docs + source .venv/docs/bin/activate (or source .venv/docs/Scripts/activate on Windows) + ``` + +2. Install required packages for docs + + ```bash + pip install -r docs/sphinx/requirements.txt + ``` + +3. Build the docs + + ```bash + python3 -m sphinx -b html -d _build/doctrees -D language=en ./docs/ docs/_build/html + ``` + +4. Serve docs locally on port 8000 + + ```bash + python3 -m http.server -d ./docs/_build/html/ + ``` + +5. You can now view the docs site by going to http://localhost:8000 + +## Auto-building the docs +The below will allow you to watch the docs directory and rebuild the documenatation each time you make a change to the documentation files: + +1. Install Sphinx Autobuild package + + ```bash + pip install sphinx-autobuild + ``` + +2. Run the autobuild (will also serve the docs on port 8000 automatically) + + ```bash + sphinx-autobuild -b html -d _build/doctrees -D language=en ./docs docs/_build/html --ignore "docs/_build/*" --ignore "docs/sphinx/_toc.yml" + ``` + +## Troubleshooting + +1. **Navigation Menu not displaying new links** + + Note that if you've recently added a new link to the navigation menu previously unchanged pages may not correctly display the new link. To fix this delete the existing `_build/` directory and rebuild the docs so that the navigation menu will be rebuilt for all pages. diff --git a/docs/contributing/documentation-standards.md b/docs/contributing/documentation-standards.md new file mode 100644 index 0000000..1482bfd --- /dev/null +++ b/docs/contributing/documentation-standards.md @@ -0,0 +1,207 @@ +# Documentation Standards + +## Voice and Tone + +### Writing Style + +- Use active voice +- Write in second person ("you") for instructions +- Maintain professional, technical tone +- Be concise and direct +- Use present tense + +Examples: + +```diff +- The configuration file will be created by the operator ++ The operator creates the configuration file + +- One should ensure that all prerequisites are met ++ Ensure all prerequisites are met +``` + +### Terminology Standards + +#### Product Names + +- "AMD GPU Operator" (not "GPU operator" or "gpu-operator") +- "Kubernetes" (not "kubernetes" or "K8s") +- "OpenShift" (not "Openshift" or "openshift") +- "AMD ROCm™" (not "ROCM" or "rocm") + +#### Technical Terms + +| Term | Usage Notes | +|------|-------------| +| AMD GPU driver | Standard term for the driver. Don't use "AMDGPU driver" or "GPU driver" alone | +| worker node | Standard term for cluster nodes. Don't use "worker" or "node" alone | +| DeviceConfig | One word, capital 'D' and 'C' when referring to the resource | +| container image | Use instead of just "image" | +| pod | Lowercase unless starting a sentence | +| namespace | Lowercase unless starting a sentence | + +#### Acronym Usage + +Always expand acronyms on first use in each document: + +- NFD (Node Feature Discovery) +- KMM (Kernel Module Management) +- CRD (Custom Resource Definition) +- CR (Custom Resource) + +## Formatting Standards + +### Headers + +- Use title case for all headers +- Add blank line before and after headers + +```markdown +# Main Title + +## Section Title + +### Subsection Title +``` + +### Code Blocks + +- Always specify language for syntax highlighting +- Use inline code format (`code`) for: + - Command names + - File names + - Variable names + - Resource names +- Use block code format (```) for: + - Command examples + - YAML/JSON examples + - Configuration files + - Output examples + +Examples: + +````markdown +Install using `helm`: + +```bash +helm install amd-gpu-operator rocm/gpu-operator-helm +``` + +Create a configuration: + +```yaml +apiVersion: amd.com/v1alpha1 +kind: DeviceConfig +metadata: + name: example +``` +```` + +### Lists + +- Maintain consistent indentation (2 spaces) +- End each list item with punctuation +- Add blank line between list items if they contain multiple sentences or code blocks + +### Admonitions + +Use consistent formatting for notes, warnings, and tips: + +```markdown +```{note} +Important supplementary information. +``` + +```{warning} +Critical information about potential problems. +``` + +```{tip} +Helpful advice for better usage. +``` + +```text + +### Tables + +- Use tables for structured information +- Include header row +- Align columns consistently +- Add blank lines before and after tables + +Example: + +```markdown +| Parameter | Description | Default | +|-----------|-------------|---------| +| `image` | Container image path | `rocm/gpu-operator:latest` | +| `version` | Driver version | `6.2.0` | +``` + +## Document Structure + +### Standard Sections + +Every document should include these sections in order: + +1. Title (H1) +2. Brief overview/introduction +3. Prerequisites (if applicable) +4. Main content +5. Verification steps (if applicable) +6. Troubleshooting (if applicable) + +### Example Template + +```markdown +# Feature Title + +Brief description of the feature or component. + +## Prerequisites + +- Required components +- Required permissions +- Required resources + +## Overview + +Detailed description of the feature. + +## Configuration + +Configuration steps and examples. + +## Verification + +Steps to verify successful implementation. + +## Troubleshooting + +Common issues and solutions. +``` + +## File Naming + +- Use lowercase +- Use hyphens for spaces +- Be descriptive but concise +- Include category prefix when applicable + +Examples: + +- `install-kubernetes.md` +- `upgrade-operator.md` + +## Links and References + +- Use relative links for internal documentation +- Use absolute links for external references +- Include link text that makes sense out of context + +Examples: + +```markdown +[Installation Guide](../install/kubernetes) +[Kubernetes Documentation](https://kubernetes.io/docs) +``` diff --git a/docs/knownlimitations.md b/docs/knownlimitations.md new file mode 100644 index 0000000..1e7ea73 --- /dev/null +++ b/docs/knownlimitations.md @@ -0,0 +1,79 @@ +# Known Issues and Limitations + +1. **GPU operator driver installs only DKMS package** + - *****Impact:***** Applications which require ROCM packages will need to install respective packages. + - ***Affected Configurations:*** All configurations + - ***Workaround:*** None as this is the intended behaviour +

+ +2. **When Using Operator to install amdgpu 6.1.3/6.2 a reboot is required to complete install** + - ***Impact:*** Node requires a reboot when upgrade is initiated due to ROCm bug. Driver install failures may be seen in dmesg + - ***Affected configurations:*** Nodes with driver version >= ROCm 6.2.x + - ***Workaround:*** Reboot the nodes upgraded manually to finish the driver install. This has been fixed in ROCm 6.3+ +

+ +3. **GPU Operator unable to install amdgpu driver if existing driver is already installed** + - ***Impact:*** Driver install will fail if amdgpu in-box Driver is present/already installed + - ***Affected Configurations:*** All configurations + - ***Workaround:*** When installing the amdgpu drivers using the GPU Operator, worker nodes should have amdgpu blacklisted or amdgpu drivers should not be pre-installed on the node. [Blacklist in-box driver](https://dcgpu.docs.amd.com/projects/gpu-operator/en/release-v1.0.0/drivers/installation.html#blacklist-inbox-driver) so that it is not loaded or remove the pre-installed driver +

+ +4. **When GPU Operator is used in SKIP driver install mode, if amdgpu module is removed with device plugin installed it will not reflect active GPU available on the server** + - ***Impact:*** Scheduling Workloads will have impact as it will scheduled on nodes which does have active GPU. + - ***Affected Configurations:*** All configurations + - ***Workaround:*** Restart the Device plugin pod deployed. +

+ +5. **Worker nodes where Kernel needs to be upgraded needs to taken out of the cluster and readded with Operator installed** + - ***Impact:*** Node upgrade will not proceed automatically and requires manual intervention + - ***Affected Configurations:*** All configurations + - ***Workaround:*** Manually mark the node as unschedulable, preventing new pods from being scheduled on it, by cordoning it off: + + ```bash + kubectl cordon + ``` + +
+ +6. **When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module** + - ***Impact:*** Driver upgrade is blocked + - ***Affected Configurations:*** All configurations + - ***Workaround:*** Disable the Metrics Exporter on specific node to allow driver upgrade as follows: + + 1. Label all nodes with new label: + + ```bash + kubectl label nodes --all amd.com/device-metrics-exporter=true + ``` + + 2. Patch DeviceConfig to include new selectors for metrics exporter: + + ```bash + kubectl patch deviceconfig gpu-operator -n kube-amd-gpu --type='merge' -p {"spec":{"metricsExporter":{"selector":{"feature.node.kubernetes.io/amd-gpu":"true","amd.com/device-metrics-exporter":"true"}}}}' + ``` + + 3. Remove the amd.com/device-metrics-exporter label for the specific node you would like to disable the exporter on: + + ```bash + kubectl label node [node-to-exclude] amd.com/device-metrics-exporter- + ``` + +
+ +7. **Due to issue with KMM 2.2 deletion of DeviceConfig Custom Resource gets stuck in Red Hat OpenShift** + - ***Impact:*** Not able to delete the DeviceConfig Custom Resource if the node reboots during uninstall. + - ***Affected Configurations:*** This issue only affects Red Hat OpenShift + - ***Workaround:*** This issue will be fixed in the next release of KMM. For the time being you can use a previous version of KMM aside from 2.2 or manually remove the status from NMC: + 1. List all the NMC resources and pick up the correct NMC (there is one nmc per node, named the same as the node it related to). + + ```bash + oc get nmc -A + ``` + + 2. Edit the NMC. + + ```bash + oc edit nmc + ``` + + 3. Remove from NMC status for all the data related to your module and save. That should allow the module to be finally deleted. diff --git a/docs/metrics/exporter.md b/docs/metrics/exporter.md index 905f220..ab1b3f0 100644 --- a/docs/metrics/exporter.md +++ b/docs/metrics/exporter.md @@ -2,29 +2,38 @@ ## Configure metrics exporter -configure ``` enable ``` field in deviceconfig Custom Resource(CR) to enable/disable metrics exporter +To start the Device Metrics Exporter along with the GPU Operator configure the ``` spec/metricsExporter/enable ``` field in deviceconfig Custom Resource(CR) to enable/disable metrics exporter ```yaml # Specify the metrics exporter config metricsExporter: # To enable/disable the metrics exporter, disabled by default enable: True + # kubernetes service type for metrics exporter, clusterIP(default) or NodePort serviceType: "NodePort" + # Node port for metrics exporter service, metrics endpoint $node-ip:$nodePort nodePort: 32500 + # image for the metrics-exporter container image: "amd/device-metrics-exporter/exporter:v1" + ``` -**metrics-exporter** pods start after updating the **DeviceConfig** CR +The **metrics-exporter** pods start after updating the **DeviceConfig** CR ```bash #kubectl get pods -n kube-amd-gpu -l "app.kubernetes.io/name=metrics-exporter" NAME READY STATUS RESTARTS AGE -test-deviceconfig-metrics-exporter-q8hbb 1/1 Running 0 74s +gpu-operator-metrics-exporter-q8hbb 1/1 Running 0 74s ``` +
+Note: The Device Metrics Exporter name will be prefixed with the name of your DeviceConfig custom resource ("gpu-operator" in the default helm installation) +

+ +## Metrics Exporter DeviceConfig | Field Name | Details | |----------------------------|----------------------------------------------| | **Enable** | Enable/Disable metrics exporter | @@ -36,6 +45,7 @@ test-deviceconfig-metrics-exporter-q8hbb 1/1 Running 0 74s | **config** | metrics configurations (fields/labels) | | | | | **name** | configmap name for custom fields/labels | +
## Customize metrics fields/labels diff --git a/docs/releasenotes.md b/docs/releasenotes.md new file mode 100644 index 0000000..dc6c22d --- /dev/null +++ b/docs/releasenotes.md @@ -0,0 +1,158 @@ +# GPU Operator v1.1.0 Release Notes + +The GPU Operator v1.1.0 release adds support for Red Hat OpenShift versions 4.16 and 4.17. The AMD GPU Operator has gone through a rigourous validation process and is now *certified* for use on OpenShift. It can now be deployed via [the Red Hat Catalog](https://catalog.redhat.com/software/container-stacks/detail/6722781e65e61b6d4caccef8). + +> **Note:** The latest AMD GPU Operator OLM Bundle for OpenShift is tagged with version v1.1.1 as the operator image has been updated to include a minor driver fix. + +## Release Highlights + +- The AMD GPU Operator has now been certified for use with Red Hat OpenShift v4.16 and v4.17 +- Updated documentation with installationa and configuration steps for Red Hat OpenShift + +## Platform Support + +### New Platform Support + +- **Red Hat OpenShift 4.16-4.17** + - Supported features: + - Driver management + - Workload scheduling + - Metrics monitoring + - Requirements: Red Hat OpenShift version 4.16 or 4.17 +
+ +## Known Limitations + +1. **Due to issue with KMM 2.2 deletion of DeviceConfig Custom Resource gets stuck in Red Hat OpenShift** + - *Impact:* Not able to delete the DeviceConfig Custom Resource if the node reboots during uninstall. + - *Affected Configurations:* This issue only affects Red Hat OpenShift + - *Workaround:* This issue will be fixed in the next release of KMM. For the time being you can use a previous version of KMM aside from 2.2 or manually remove the status from NMC: + 1. List all the NMC resources and pick up the correct NMC (there is one nmc per node, named the same as the node it related to). + + ```bash + oc get nmc -A + ``` + + 2. Edit the NMC. + + ```bash + oc edit nmc + ``` + + 3. Remove from NMC status for all the data related to your module and save. That should allow the module to be finally deleted. + +

+ +# GPU Operator v1.0.0 Release Notes + +This release is the first major release of AMD GPU Operator. The AMD GPU Operator simplifies the deployment and management of AMD Instinct™ GPU accelerators within Kubernetes clusters. This project enables seamless configuration and operation of GPU-accelerated workloads, including machine learning, Generative AI, and other GPU-intensive applications. + +## Release Highlights + +- Manage AMD GPU drivers with desired versions on Kubernetes cluster nodes +- Customized scheduling of AMD GPU workloads within Kubernetes cluster +- Metrics and statistics monitoring solution for AMD GPU hardware and workloads +- Support specialized networking environment like HTTP proxy or Air-gapped network + +## Hardware Support + +### New Hardware Support + +- **AMD Instinct™ MI300** + - Required driver version: ROCm 6.2+ + +- **AMD Instinct™ MI250** + - Required driver version: ROCm 6.2+ + +- **AMD Instinct™ MI210** + - Required driver version: ROCm 6.2+ + +## Platform Support + +### New Platform Support + +- **Kubernetes 1.29+** + - Supported features: + - Driver management + - Workload scheduling + - Metrics monitoring + - Requirements: Kubernetes version 1.29+ + +## Breaking Changes + +Not Applicable as this is the initial release. +
+ +## New Features + +### Feature Category + +- **Driver management** + - *Managed Driver Installations:* Users will be able to install ROCm 6.2+ dkms driver on Kubernetes worker nodes, they can also optionally choose to use inbox or pre-installed driver on the worker nodes + - *DeviceConfig Custom Resource:* Users can configure a new DeviceConfig CRD (Custom Resource Definition) to define the driver management behavior of the GPU Operator + +- **GPU Workload Scheduling** + - *Custom Resource Allocation "amd.com/gpu":* After the deployment of the GPU Operator a new custom resource allocation will be present on each GPU node, `amd.com/gpu`, which will list the allocatable GPU resources on the node for which GPU workloads can be scheduled against + - *Assign Multiple GPUs:* Users can easily specify the number of AMD GPUs required by each workload in the [deployment/pod spec](https://dcgpu.docs.amd.com/projects/gpu-operator/en/release-v1.0.0/usage.html#creating-a-gpu-enabled-pod) and the Kubernetes scheduler wiill automatically take care of assigning the correct GPU resources + +- **Metrics Monitoring for GPUs and Workloads**: + - *Out-of-box Metrics:* Users can optionally enable the AMD Device Metrics Exporter when installing the AMD GPU Operator to enable a robust out-of-box monitoring solution for prometheus to consume + - *Custom Metrics Configurations:* Users can utilize a [configmap](https://dcgpu.docs.amd.com/projects/gpu-operator/en/release-v1.0.0/metrics/exporter.html#configure-metrics-exporter) to customize the configuration and behavior of Device Metrics Exporter + +- **Specialized Network Setups**: + - *Air-gapped Installation:* Users can install the GPU Operator in a secure [air-gapped environment](https://dcgpu.docs.amd.com/projects/gpu-operator/en/release-v1.0.0/specialized_networks/airgapped-install.html) where the Kubernetes cluster has no external network connectivity + - *HTTP Proxy Support:* The AMD GPU Operator supports usage within a Kubernetes cluster that is behind an [HTTP Proxy](https://dcgpu.docs.amd.com/projects/gpu-operator/en/release-v1.0.0/specialized_networks/http-proxy.html). Support for HTTPS Proxy will be added in a future version of the GPU Operator. + +## Known Limitations + +1. **GPU operator driver installs only DKMS package** + - *Impact:* Applications which require ROCM packages will need to install respective packages. + - *Affected Configurations:* All configurations + - *Workaround:* None as this is the intended behaviour + +2. **When Using Operator to install amdgpu 6.1.3/6.2 a reboot is required to complete install** + - *Impact:* Node requires a reboot when upgrade is initiated due to ROCm bug. Driver install failures may be seen in dmesg + - *Affected configurations:* Nodes with driver version >= ROCm 6.2.x + - *Workaround:* Reboot the nodes upgraded manually to finish the driver install. This has been fixed in ROCm 6.3+ + +3. **GPU Operator unable to install amdgpu driver if existing driver is already installed** + - *Impact:* Driver install will fail if amdgpu in-box Driver is present/already installed + - *Affected Configurations:* All configurations + - *Workaround:* When installing the amdgpu drivers using the GPU Operator, worker nodes should have amdgpu blacklisted or amdgpu drivers should not be pre-installed on the node. [Blacklist in-box driver](https://dcgpu.docs.amd.com/projects/gpu-operator/en/release-v1.0.0/drivers/installation.html#blacklist-inbox-driver) so that it is not loaded or remove the pre-installed driver + +4. **When GPU Operator is used in SKIP driver install mode, if amdgpu module is removed with device plugin installed it will not reflect active GPU available on the server** + - *Impact:* Scheduling Workloads will have impact as it will scheduled on nodes which does have active GPU. + - *Affected Configurations:* All configurations + - *Workaround:* Restart the Device plugin pod deployed. + +5. **Worker nodes where Kernel needs to be upgraded needs to taken out of the cluster and readded with Operator installed** + - *Impact:* Node upgrade will not proceed automatically and requires manual intervention + - *Affected Configurations:* All configurations + - *Workaround:* Manually mark the node as unschedulable, preventing new pods from being scheduled on it, by cordoning it off: + + ```bash + kubectl cordon + ``` + +6. **When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module** + - *Impact:* Driver upgrade is blocked + - *Affected Configurations:* All configurations + - *Workaround:* Disable the Metrics Exporter on specific node to allow driver upgrade as follows: + + 1. Label all nodes with new label: + + ```bash + kubectl label nodes --all amd.com/device-metrics-exporter=true + ``` + + 2. Patch DeviceConfig to include new selectors for metrics exporter: + + ```bash + kubectl patch deviceconfig gpu-operator -n kube-amd-gpu --type='merge' -p {"spec":{"metricsExporter":{"selector":{"feature.node.kubernetes.io/amd-gpu":"true","amd.com/device-metrics-exporter":"true"}}}}' + ``` + + 3. Remove the amd.com/device-metrics-exporter label for the specific node you would like to disable the exporter on: + + ```bash + kubectl label node [node-to-exclude] amd.com/device-metrics-exporter- + ``` diff --git a/docs/site/metrics/exporter/index.html b/docs/site/metrics/exporter/index.html index 0465158..e5d42ec 100644 --- a/docs/site/metrics/exporter/index.html +++ b/docs/site/metrics/exporter/index.html @@ -179,7 +179,7 @@

Configure metrics exporter

serviceType -service type for metrics, clusterIP/Nodeport +service type for metrics, clusterIP/NodePort nodePort diff --git a/docs/sphinx/_toc.yml b/docs/sphinx/_toc.yml index e70d92a..825614f 100644 --- a/docs/sphinx/_toc.yml +++ b/docs/sphinx/_toc.yml @@ -6,11 +6,17 @@ subtrees: - caption: Usage entries: - file: overview + - file: releasenotes + title: Release Notes - file: usage - - file: troubleshooting + - file: troubleshooting + - file: knownlimitations + title: Known Limitations - caption: Installation entries: - file: installation/kubernetes-helm + - file: installation/openshift-helm + - file: installation/openshift-olm - file: uninstallation/uninstallation - caption: Driver Management entries: @@ -27,3 +33,8 @@ subtrees: entries: - file: specialized_networks/airgapped-install - file: specialized_networks/http-proxy + - caption: Contributing + entries: + - file: contributing/documentation-build-guide + - file: contributing/documentation-standards + diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index e70d92a..825614f 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -6,11 +6,17 @@ subtrees: - caption: Usage entries: - file: overview + - file: releasenotes + title: Release Notes - file: usage - - file: troubleshooting + - file: troubleshooting + - file: knownlimitations + title: Known Limitations - caption: Installation entries: - file: installation/kubernetes-helm + - file: installation/openshift-helm + - file: installation/openshift-olm - file: uninstallation/uninstallation - caption: Driver Management entries: @@ -27,3 +33,8 @@ subtrees: entries: - file: specialized_networks/airgapped-install - file: specialized_networks/http-proxy + - caption: Contributing + entries: + - file: contributing/documentation-build-guide + - file: contributing/documentation-standards +