-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added Release Notes, Known Limitations and Contributing section to do…
…cs (#31) * Added Release Notes, Known Limitations and Contributing section to docs * Fixed Spelling * Fixed reference to Exporter example config.json --------- Co-authored-by: Farshad Ghodsian <[email protected]>
- Loading branch information
1 parent
f7e96b8
commit db232ec
Showing
10 changed files
with
540 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -52,3 +52,6 @@ verison | |
webhook | ||
CRD | ||
uninstallation | ||
OpenShift | ||
Autobuild | ||
NMC |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# Documentation Build Guide | ||
|
||
This guide provides information for developers who want to contribute to the AMD GPU Operator documentation available at https://dcgpu.docs.amd.com/projects/gpu-operator. The docs use [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) as their base and the below guide will show how you can build and serve the docs locally for testing. | ||
|
||
## Building and Serving the Docs | ||
|
||
1. Create a Python Virtual Environment (optional, but recommended) | ||
|
||
```bash | ||
python3 -m venv .venv/docs | ||
source .venv/docs/bin/activate (or source .venv/docs/Scripts/activate on Windows) | ||
``` | ||
|
||
2. Install required packages for docs | ||
|
||
```bash | ||
pip install -r docs/sphinx/requirements.txt | ||
``` | ||
|
||
3. Build the docs | ||
|
||
```bash | ||
python3 -m sphinx -b html -d _build/doctrees -D language=en ./docs/ docs/_build/html | ||
``` | ||
|
||
4. Serve docs locally on port 8000 | ||
|
||
```bash | ||
python3 -m http.server -d ./docs/_build/html/ | ||
``` | ||
|
||
5. You can now view the docs site by going to http://localhost:8000 | ||
|
||
## Auto-building the docs | ||
Check failure on line 34 in docs/contributing/documentation-build-guide.md GitHub Actions / Documentation / MarkdownHeadings should be surrounded by blank lines
|
||
The below will allow you to watch the docs directory and rebuild the documenatation each time you make a change to the documentation files: | ||
|
||
1. Install Sphinx Autobuild package | ||
|
||
```bash | ||
pip install sphinx-autobuild | ||
``` | ||
|
||
2. Run the autobuild (will also serve the docs on port 8000 automatically) | ||
|
||
```bash | ||
sphinx-autobuild -b html -d _build/doctrees -D language=en ./docs docs/_build/html --ignore "docs/_build/*" --ignore "docs/sphinx/_toc.yml" | ||
``` | ||
|
||
## Troubleshooting | ||
|
||
1. **Navigation Menu not displaying new links** | ||
|
||
Note that if you've recently added a new link to the navigation menu previously unchanged pages may not correctly display the new link. To fix this delete the existing `_build/` directory and rebuild the docs so that the navigation menu will be rebuilt for all pages. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,207 @@ | ||
# Documentation Standards | ||
|
||
## Voice and Tone | ||
|
||
### Writing Style | ||
|
||
- Use active voice | ||
- Write in second person ("you") for instructions | ||
- Maintain professional, technical tone | ||
- Be concise and direct | ||
- Use present tense | ||
|
||
Examples: | ||
|
||
```diff | ||
- The configuration file will be created by the operator | ||
+ The operator creates the configuration file | ||
|
||
- One should ensure that all prerequisites are met | ||
+ Ensure all prerequisites are met | ||
``` | ||
|
||
### Terminology Standards | ||
|
||
#### Product Names | ||
|
||
- "AMD GPU Operator" (not "GPU operator" or "gpu-operator") | ||
- "Kubernetes" (not "kubernetes" or "K8s") | ||
- "OpenShift" (not "Openshift" or "openshift") | ||
- "AMD ROCm™" (not "ROCM" or "rocm") | ||
|
||
#### Technical Terms | ||
|
||
| Term | Usage Notes | | ||
|------|-------------| | ||
| AMD GPU driver | Standard term for the driver. Don't use "AMDGPU driver" or "GPU driver" alone | | ||
| worker node | Standard term for cluster nodes. Don't use "worker" or "node" alone | | ||
| DeviceConfig | One word, capital 'D' and 'C' when referring to the resource | | ||
| container image | Use instead of just "image" | | ||
| pod | Lowercase unless starting a sentence | | ||
| namespace | Lowercase unless starting a sentence | | ||
|
||
#### Acronym Usage | ||
|
||
Always expand acronyms on first use in each document: | ||
|
||
- NFD (Node Feature Discovery) | ||
- KMM (Kernel Module Management) | ||
- CRD (Custom Resource Definition) | ||
- CR (Custom Resource) | ||
|
||
## Formatting Standards | ||
|
||
### Headers | ||
|
||
- Use title case for all headers | ||
- Add blank line before and after headers | ||
|
||
```markdown | ||
# Main Title | ||
|
||
## Section Title | ||
|
||
### Subsection Title | ||
``` | ||
|
||
### Code Blocks | ||
|
||
- Always specify language for syntax highlighting | ||
- Use inline code format (`code`) for: | ||
- Command names | ||
- File names | ||
- Variable names | ||
- Resource names | ||
- Use block code format (```) for: | ||
- Command examples | ||
- YAML/JSON examples | ||
- Configuration files | ||
- Output examples | ||
|
||
Examples: | ||
|
||
````markdown | ||
Install using `helm`: | ||
|
||
```bash | ||
helm install amd-gpu-operator rocm/gpu-operator-helm | ||
``` | ||
|
||
Create a configuration: | ||
|
||
```yaml | ||
apiVersion: amd.com/v1alpha1 | ||
kind: DeviceConfig | ||
metadata: | ||
name: example | ||
``` | ||
```` | ||
|
||
### Lists | ||
|
||
- Maintain consistent indentation (2 spaces) | ||
- End each list item with punctuation | ||
- Add blank line between list items if they contain multiple sentences or code blocks | ||
|
||
### Admonitions | ||
|
||
Use consistent formatting for notes, warnings, and tips: | ||
|
||
```markdown | ||
```{note} | ||
Important supplementary information. | ||
``` | ||
|
||
```{warning} | ||
Critical information about potential problems. | ||
``` | ||
|
||
```{tip} | ||
Helpful advice for better usage. | ||
``` | ||
|
||
```text | ||
### Tables | ||
- Use tables for structured information | ||
- Include header row | ||
- Align columns consistently | ||
- Add blank lines before and after tables | ||
Example: | ||
```markdown | ||
| Parameter | Description | Default | | ||
|-----------|-------------|---------| | ||
| `image` | Container image path | `rocm/gpu-operator:latest` | | ||
| `version` | Driver version | `6.2.0` | | ||
``` | ||
|
||
## Document Structure | ||
|
||
### Standard Sections | ||
|
||
Every document should include these sections in order: | ||
|
||
1. Title (H1) | ||
2. Brief overview/introduction | ||
3. Prerequisites (if applicable) | ||
4. Main content | ||
5. Verification steps (if applicable) | ||
6. Troubleshooting (if applicable) | ||
|
||
### Example Template | ||
|
||
```markdown | ||
# Feature Title | ||
|
||
Brief description of the feature or component. | ||
|
||
## Prerequisites | ||
|
||
- Required components | ||
- Required permissions | ||
- Required resources | ||
|
||
## Overview | ||
|
||
Detailed description of the feature. | ||
|
||
## Configuration | ||
|
||
Configuration steps and examples. | ||
|
||
## Verification | ||
|
||
Steps to verify successful implementation. | ||
|
||
## Troubleshooting | ||
|
||
Common issues and solutions. | ||
``` | ||
|
||
## File Naming | ||
|
||
- Use lowercase | ||
- Use hyphens for spaces | ||
- Be descriptive but concise | ||
- Include category prefix when applicable | ||
|
||
Examples: | ||
|
||
- `install-kubernetes.md` | ||
- `upgrade-operator.md` | ||
|
||
## Links and References | ||
|
||
- Use relative links for internal documentation | ||
- Use absolute links for external references | ||
- Include link text that makes sense out of context | ||
|
||
Examples: | ||
|
||
```markdown | ||
[Installation Guide](../install/kubernetes) | ||
[Kubernetes Documentation](https://kubernetes.io/docs) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
# Known Issues and Limitations | ||
|
||
1. **GPU operator driver installs only DKMS package** | ||
- *****Impact:***** Applications which require ROCM packages will need to install respective packages. | ||
- ***Affected Configurations:*** All configurations | ||
- ***Workaround:*** None as this is the intended behaviour | ||
</br></br> | ||
|
||
2. **When Using Operator to install amdgpu 6.1.3/6.2 a reboot is required to complete install** | ||
- ***Impact:*** Node requires a reboot when upgrade is initiated due to ROCm bug. Driver install failures may be seen in dmesg | ||
- ***Affected configurations:*** Nodes with driver version >= ROCm 6.2.x | ||
- ***Workaround:*** Reboot the nodes upgraded manually to finish the driver install. This has been fixed in ROCm 6.3+ | ||
</br></br> | ||
|
||
3. **GPU Operator unable to install amdgpu driver if existing driver is already installed** | ||
- ***Impact:*** Driver install will fail if amdgpu in-box Driver is present/already installed | ||
- ***Affected Configurations:*** All configurations | ||
- ***Workaround:*** When installing the amdgpu drivers using the GPU Operator, worker nodes should have amdgpu blacklisted or amdgpu drivers should not be pre-installed on the node. [Blacklist in-box driver](https://dcgpu.docs.amd.com/projects/gpu-operator/en/release-v1.0.0/drivers/installation.html#blacklist-inbox-driver) so that it is not loaded or remove the pre-installed driver | ||
</br></br> | ||
|
||
4. **When GPU Operator is used in SKIP driver install mode, if amdgpu module is removed with device plugin installed it will not reflect active GPU available on the server** | ||
- ***Impact:*** Scheduling Workloads will have impact as it will scheduled on nodes which does have active GPU. | ||
- ***Affected Configurations:*** All configurations | ||
- ***Workaround:*** Restart the Device plugin pod deployed. | ||
</br></br> | ||
|
||
5. **Worker nodes where Kernel needs to be upgraded needs to taken out of the cluster and readded with Operator installed** | ||
- ***Impact:*** Node upgrade will not proceed automatically and requires manual intervention | ||
- ***Affected Configurations:*** All configurations | ||
- ***Workaround:*** Manually mark the node as unschedulable, preventing new pods from being scheduled on it, by cordoning it off: | ||
|
||
```bash | ||
kubectl cordon <node-name> | ||
``` | ||
|
||
</br> | ||
|
||
6. **When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module** | ||
Check failure on line 38 in docs/knownlimitations.md GitHub Actions / Documentation / MarkdownOrdered list item prefix
|
||
- ***Impact:*** Driver upgrade is blocked | ||
- ***Affected Configurations:*** All configurations | ||
- ***Workaround:*** Disable the Metrics Exporter on specific node to allow driver upgrade as follows: | ||
|
||
1. Label all nodes with new label: | ||
|
||
```bash | ||
kubectl label nodes --all amd.com/device-metrics-exporter=true | ||
``` | ||
|
||
2. Patch DeviceConfig to include new selectors for metrics exporter: | ||
|
||
```bash | ||
kubectl patch deviceconfig gpu-operator -n kube-amd-gpu --type='merge' -p {"spec":{"metricsExporter":{"selector":{"feature.node.kubernetes.io/amd-gpu":"true","amd.com/device-metrics-exporter":"true"}}}}' | ||
``` | ||
3. Remove the amd.com/device-metrics-exporter label for the specific node you would like to disable the exporter on: | ||
```bash | ||
kubectl label node [node-to-exclude] amd.com/device-metrics-exporter- | ||
``` | ||
</br> | ||
7. **Due to issue with KMM 2.2 deletion of DeviceConfig Custom Resource gets stuck in Red Hat OpenShift** | ||
Check failure on line 63 in docs/knownlimitations.md GitHub Actions / Documentation / MarkdownOrdered list item prefix
|
||
- ***Impact:*** Not able to delete the DeviceConfig Custom Resource if the node reboots during uninstall. | ||
- ***Affected Configurations:*** This issue only affects Red Hat OpenShift | ||
- ***Workaround:*** This issue will be fixed in the next release of KMM. For the time being you can use a previous version of KMM aside from 2.2 or manually remove the status from NMC: | ||
1. List all the NMC resources and pick up the correct NMC (there is one nmc per node, named the same as the node it related to). | ||
```bash | ||
oc get nmc -A | ||
``` | ||
2. Edit the NMC. | ||
```bash | ||
oc edit nmc <nmc name> | ||
``` | ||
3. Remove from NMC status for all the data related to your module and save. That should allow the module to be finally deleted. |
Oops, something went wrong.