Device Metrics Exporter exports metrics from AMD GPUs to collectors like Prometheus.
- Ubuntu 22.04
- ROCm 6.2
- Kubernetes cluster is up and running
- Helm tool is installed on the node with kubectl + kube config file to get access to the cluster
- Prepare
values.yaml
to setup the deployment parameters, for example:platform: k8s nodeSelector: {} # add customized nodeSelector for metrics exporter daemonset image: repository: docker.io/rocm/device-metrics-exporter tag: v1.0.0 pullPolicy: Always pullSecrets: "" # put name of docker-registry secret here if needed for exporter image service: type: NodePort # select NodePort or ClusterIP as metrics exporter's service type ClusterIP: port: 5000 # cluster internal service port NodePort: port: 5000 # cluster internal service port nodePort: 32500 # external node port configMap: "" # put name of configmap here if needed for customizing exported stats
- (Optional) if you want to customize the exported stats, please create a configmap by using
example/configmap.yaml
(please modify the namespace to align with helm install command), and put the configmap name intovalues.yaml
.
-
Run
helm install
command to deploy exporter in your Kubernetes cluster:helm install exporter https://github.com/ROCm/device-metrics-exporter/releases/download/v1.0.0/device-metrics-exporter-charts-v1.0.0.tgz -n mynamespace -f values.yaml
-
Option 1: you can directly modify the Kubernetes resource to modify the config, including modifying configmap, service, rbac or daemonset resources.
-
Option 2: you can prepare the updated
values.yaml
and do a helm chart upgrade:helm upgrade exporter -n mynamespace -f updated_values.yaml
-
Uninstall the helm charts by running:
helm uninstall exporter -n mynamespace
- Ubuntu 22.04/24.04
- ROCm 6.2
- Docker (podman or another alternative can be used, but commands provided are for Docker)
-
Clone rocm/device-metrics-exporter repo and cd into folder
git clone https://github.com/rocm/device-metrics-exporter cd device-metrics-exporter
-
Run metrics-exporter standalone docker container
docker run -itd --device=/dev/dri --device=/dev/kfd -v ./config:/etc/metrics -p 5000:5000 --name exporter rocm/device-metrics-exporter:v1.0.0
-
Update prometheus.yml file to replace localhost with host.docker.internal
sed -i 's/localhost:5000/host.docker.internal:5000/g' example/prometheus.yml
- Run Prometheus as standalone docker container
docker run -itd -p 9090:9090 -v ./example/prometheus.yml:/etc/prometheus/prometheus.yml -v prometheus-data:/prometheus --add-host host.docker.internal:host-gateway --name prometheus prom/prometheus
- Prometheus should now be accessable on http://localhost:9090
- Check the Status > Target health page to ensure the metrics exporter target is Up
- Run Grafana as standalone docker container
docker run --rm -d --name grafana -p 3000:3000 --add-host host.docker.internal:host-gateway grafana/grafana:latest
- Grafana should now be accessable on http://localhost:3000. Username and password to login is
admin/admin
Note: Be sure to use http://host.docker.internal:9090 when adding Prometheus as a Data connection in the Grafana UI
- Run the below commands to install Grafana on your system. See Grafana docs for more details
sudo apt-get install -y apt-transport-https software-properties-common wget sudo mkdir -p /etc/apt/keyrings/ wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee -a /etc/apt sources.list.d/grafana.list sudo apt-get update sudo apt-get install grafana
- Run the Grafana server
sudo systemctl daemon-reload sudo systemctl start grafana-server sudo systemctl status grafana-server