Skip to content

Commit

Permalink
Added Release Notes, Known Limitations and Contributing section to do…
Browse files Browse the repository at this point in the history
…cs (#31)

* Added Release Notes, Known Limitations and Contributing section to docs

* Fixed Spelling

* Fixed reference to Exporter example config.json

---------

Co-authored-by: Farshad Ghodsian <[email protected]>
  • Loading branch information
farshadghodsian and Farshad Ghodsian authored Dec 23, 2024
1 parent f7e96b8 commit db232ec
Show file tree
Hide file tree
Showing 10 changed files with 540 additions and 8 deletions.
3 changes: 3 additions & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,6 @@ verison
webhook
CRD
uninstallation
OpenShift
Autobuild
NMC
4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
external_projects = ["amd-gpu-operator"]
external_projects_current_project = "amd-gpu-operator"

project = "AMD Instinct Hub"
version = "1.0.0"
project = "AMD Instinct Documentation"
version = "1.1.0"
release = version
html_title = f"AMD GPU Operator {version}"
author = "Advanced Micro Devices, Inc."
Expand Down
53 changes: 53 additions & 0 deletions docs/contributing/documentation-build-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Documentation Build Guide

This guide provides information for developers who want to contribute to the AMD GPU Operator documentation available at https://dcgpu.docs.amd.com/projects/gpu-operator. The docs use [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) as their base and the below guide will show how you can build and serve the docs locally for testing.

## Building and Serving the Docs

1. Create a Python Virtual Environment (optional, but recommended)

```bash
python3 -m venv .venv/docs
source .venv/docs/bin/activate (or source .venv/docs/Scripts/activate on Windows)
```

2. Install required packages for docs

```bash
pip install -r docs/sphinx/requirements.txt
```

3. Build the docs

```bash
python3 -m sphinx -b html -d _build/doctrees -D language=en ./docs/ docs/_build/html
```

4. Serve docs locally on port 8000

```bash
python3 -m http.server -d ./docs/_build/html/
```

5. You can now view the docs site by going to http://localhost:8000

## Auto-building the docs

Check failure on line 34 in docs/contributing/documentation-build-guide.md

View workflow job for this annotation

GitHub Actions / Documentation / Markdown

Headings should be surrounded by blank lines

docs/contributing/documentation-build-guide.md:34 MD022/blanks-around-headings/blanks-around-headers Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Below] [Context: "## Auto-building the docs"] https://github.com/DavidAnson/markdownlint/blob/v0.28.2/doc/md022.md
The below will allow you to watch the docs directory and rebuild the documenatation each time you make a change to the documentation files:

1. Install Sphinx Autobuild package

```bash
pip install sphinx-autobuild
```

2. Run the autobuild (will also serve the docs on port 8000 automatically)

```bash
sphinx-autobuild -b html -d _build/doctrees -D language=en ./docs docs/_build/html --ignore "docs/_build/*" --ignore "docs/sphinx/_toc.yml"
```

## Troubleshooting

1. **Navigation Menu not displaying new links**

Note that if you've recently added a new link to the navigation menu previously unchanged pages may not correctly display the new link. To fix this delete the existing `_build/` directory and rebuild the docs so that the navigation menu will be rebuilt for all pages.
207 changes: 207 additions & 0 deletions docs/contributing/documentation-standards.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
# Documentation Standards

## Voice and Tone

### Writing Style

- Use active voice
- Write in second person ("you") for instructions
- Maintain professional, technical tone
- Be concise and direct
- Use present tense

Examples:

```diff
- The configuration file will be created by the operator
+ The operator creates the configuration file

- One should ensure that all prerequisites are met
+ Ensure all prerequisites are met
```

### Terminology Standards

#### Product Names

- "AMD GPU Operator" (not "GPU operator" or "gpu-operator")
- "Kubernetes" (not "kubernetes" or "K8s")
- "OpenShift" (not "Openshift" or "openshift")
- "AMD ROCm™" (not "ROCM" or "rocm")

#### Technical Terms

| Term | Usage Notes |
|------|-------------|
| AMD GPU driver | Standard term for the driver. Don't use "AMDGPU driver" or "GPU driver" alone |
| worker node | Standard term for cluster nodes. Don't use "worker" or "node" alone |
| DeviceConfig | One word, capital 'D' and 'C' when referring to the resource |
| container image | Use instead of just "image" |
| pod | Lowercase unless starting a sentence |
| namespace | Lowercase unless starting a sentence |

#### Acronym Usage

Always expand acronyms on first use in each document:

- NFD (Node Feature Discovery)
- KMM (Kernel Module Management)
- CRD (Custom Resource Definition)
- CR (Custom Resource)

## Formatting Standards

### Headers

- Use title case for all headers
- Add blank line before and after headers

```markdown
# Main Title

## Section Title

### Subsection Title
```

### Code Blocks

- Always specify language for syntax highlighting
- Use inline code format (`code`) for:
- Command names
- File names
- Variable names
- Resource names
- Use block code format (```) for:
- Command examples
- YAML/JSON examples
- Configuration files
- Output examples

Examples:

````markdown
Install using `helm`:

```bash
helm install amd-gpu-operator rocm/gpu-operator-helm
```

Create a configuration:

```yaml
apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
name: example
```
````

### Lists

- Maintain consistent indentation (2 spaces)
- End each list item with punctuation
- Add blank line between list items if they contain multiple sentences or code blocks

### Admonitions

Use consistent formatting for notes, warnings, and tips:

```markdown
```{note}
Important supplementary information.
```

```{warning}
Critical information about potential problems.
```

```{tip}
Helpful advice for better usage.
```

```text
### Tables
- Use tables for structured information
- Include header row
- Align columns consistently
- Add blank lines before and after tables
Example:
```markdown
| Parameter | Description | Default |
|-----------|-------------|---------|
| `image` | Container image path | `rocm/gpu-operator:latest` |
| `version` | Driver version | `6.2.0` |
```

## Document Structure

### Standard Sections

Every document should include these sections in order:

1. Title (H1)
2. Brief overview/introduction
3. Prerequisites (if applicable)
4. Main content
5. Verification steps (if applicable)
6. Troubleshooting (if applicable)

### Example Template

```markdown
# Feature Title

Brief description of the feature or component.

## Prerequisites

- Required components
- Required permissions
- Required resources

## Overview

Detailed description of the feature.

## Configuration

Configuration steps and examples.

## Verification

Steps to verify successful implementation.

## Troubleshooting

Common issues and solutions.
```

## File Naming

- Use lowercase
- Use hyphens for spaces
- Be descriptive but concise
- Include category prefix when applicable

Examples:

- `install-kubernetes.md`
- `upgrade-operator.md`

## Links and References

- Use relative links for internal documentation
- Use absolute links for external references
- Include link text that makes sense out of context

Examples:

```markdown
[Installation Guide](../install/kubernetes)
[Kubernetes Documentation](https://kubernetes.io/docs)
```
79 changes: 79 additions & 0 deletions docs/knownlimitations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Known Issues and Limitations

1. **GPU operator driver installs only DKMS package**
- *****Impact:***** Applications which require ROCM packages will need to install respective packages.
- ***Affected Configurations:*** All configurations
- ***Workaround:*** None as this is the intended behaviour
</br></br>

2. **When Using Operator to install amdgpu 6.1.3/6.2 a reboot is required to complete install**
- ***Impact:*** Node requires a reboot when upgrade is initiated due to ROCm bug. Driver install failures may be seen in dmesg
- ***Affected configurations:*** Nodes with driver version >= ROCm 6.2.x
- ***Workaround:*** Reboot the nodes upgraded manually to finish the driver install. This has been fixed in ROCm 6.3+
</br></br>

3. **GPU Operator unable to install amdgpu driver if existing driver is already installed**
- ***Impact:*** Driver install will fail if amdgpu in-box Driver is present/already installed
- ***Affected Configurations:*** All configurations
- ***Workaround:*** When installing the amdgpu drivers using the GPU Operator, worker nodes should have amdgpu blacklisted or amdgpu drivers should not be pre-installed on the node. [Blacklist in-box driver](https://dcgpu.docs.amd.com/projects/gpu-operator/en/release-v1.0.0/drivers/installation.html#blacklist-inbox-driver) so that it is not loaded or remove the pre-installed driver
</br></br>

4. **When GPU Operator is used in SKIP driver install mode, if amdgpu module is removed with device plugin installed it will not reflect active GPU available on the server**
- ***Impact:*** Scheduling Workloads will have impact as it will scheduled on nodes which does have active GPU.
- ***Affected Configurations:*** All configurations
- ***Workaround:*** Restart the Device plugin pod deployed.
</br></br>

5. **Worker nodes where Kernel needs to be upgraded needs to taken out of the cluster and readded with Operator installed**
- ***Impact:*** Node upgrade will not proceed automatically and requires manual intervention
- ***Affected Configurations:*** All configurations
- ***Workaround:*** Manually mark the node as unschedulable, preventing new pods from being scheduled on it, by cordoning it off:

```bash
kubectl cordon <node-name>
```

</br>

6. **When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module**

Check failure on line 38 in docs/knownlimitations.md

View workflow job for this annotation

GitHub Actions / Documentation / Markdown

Ordered list item prefix

docs/knownlimitations.md:38:1 MD029/ol-prefix Ordered list item prefix [Expected: 1; Actual: 6; Style: 1/2/3] https://github.com/DavidAnson/markdownlint/blob/v0.28.2/doc/md029.md
- ***Impact:*** Driver upgrade is blocked
- ***Affected Configurations:*** All configurations
- ***Workaround:*** Disable the Metrics Exporter on specific node to allow driver upgrade as follows:

1. Label all nodes with new label:

```bash
kubectl label nodes --all amd.com/device-metrics-exporter=true
```

2. Patch DeviceConfig to include new selectors for metrics exporter:

```bash
kubectl patch deviceconfig gpu-operator -n kube-amd-gpu --type='merge' -p {"spec":{"metricsExporter":{"selector":{"feature.node.kubernetes.io/amd-gpu":"true","amd.com/device-metrics-exporter":"true"}}}}'
```
3. Remove the amd.com/device-metrics-exporter label for the specific node you would like to disable the exporter on:
```bash
kubectl label node [node-to-exclude] amd.com/device-metrics-exporter-
```
</br>
7. **Due to issue with KMM 2.2 deletion of DeviceConfig Custom Resource gets stuck in Red Hat OpenShift**

Check failure on line 63 in docs/knownlimitations.md

View workflow job for this annotation

GitHub Actions / Documentation / Markdown

Ordered list item prefix

docs/knownlimitations.md:63:1 MD029/ol-prefix Ordered list item prefix [Expected: 1; Actual: 7; Style: 1/2/3] https://github.com/DavidAnson/markdownlint/blob/v0.28.2/doc/md029.md
- ***Impact:*** Not able to delete the DeviceConfig Custom Resource if the node reboots during uninstall.
- ***Affected Configurations:*** This issue only affects Red Hat OpenShift
- ***Workaround:*** This issue will be fixed in the next release of KMM. For the time being you can use a previous version of KMM aside from 2.2 or manually remove the status from NMC:
1. List all the NMC resources and pick up the correct NMC (there is one nmc per node, named the same as the node it related to).
```bash
oc get nmc -A
```
2. Edit the NMC.
```bash
oc edit nmc <nmc name>
```
3. Remove from NMC status for all the data related to your module and save. That should allow the module to be finally deleted.
Loading

0 comments on commit db232ec

Please sign in to comment.