Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recepie for setting up slurm for exporter #16

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions slurm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Slurm exporter setup

Exporter requires the epilog and prolog scripts to be placed in the
slurm configuration path respectively

## Host/Node Setup

This is to be performed on all the nodes on the cluster where the exporter is to be deployed.

Below is a smaple `slurm.conf` entry which points to the epilog and prolog directories.

```
# Epilogs and Prologs
Epilog="/etc/slurm/epilog.d/*"
Prolog="/etc/slurm/prolog.d/*"
```

Copy the files to the path
```
cp exporter-epilog.sh /etc/slurm/epilog.d/exporter-epilog.sh
cp exporter-prolog.sh /etc/slurm/epilog.d/exporter-prolog.sh
```

Once all the necessary configurations are done restart slurmd

```
systemctl restart slurmd.service
```

## Exporter Container Deployment

### directory setup

1. It is recommended to create following heirary to keep the exporter data in persistant files on the host

```
$ tree -d exporter/
exporter/
- config/
- config.json
```

2. `/var/run/exporter` directory must be created on the host, as this is used by prolog and epilog scripts to tracking the slurm jobs

```
mkdir -p /var/run/exporter
```

### label configuration
By default the exporter labels are not enabled for slurm labels. Use the below content for config.json file.

```
{
"GPUConfig": {
"Labels": [
"GPU_UUID",
"SERIAL_NUMBER",
"GPU_ID",
"JOB_ID",
"JOB_USER",
"JOB_PARTITION",
"CLUSTER_NAME",
"CARD_SERIES",
"CARD_MODEL",
"CARD_VENDOR",
"DRIVER_VERSION",
"VBIOS_VERSION",
"HOSTNAME"
]
}
}

```

### start exporter

Once we have all the above steps done we can start the exporter

```
docker run -d \
--device=/dev/dri \
--device=/dev/kfd \
-v ./config:/etc/metrics \
-v /var/run/exporter/:/var/run/exporter/ \
-p 5000:5000 --name exporter \
rocm/device-metrics-exporter:v1.0.0
```
37 changes: 37 additions & 0 deletions slurm/exporter-epilog.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/bin/bash

#
#Copyright (c) Advanced Micro Devices, Inc. All rights reserved.

#Licensed under the Apache License, Version 2.0 (the \"License\");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an \"AS IS\" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
#

EXPORT_DIR="/var/run/exporter/"
MSG=$(
cat <<EOF
{
"SLURM_JOB_ID": "${SLURM_JOB_ID}",
"SLURM_JOB_USER": "${SLURM_JOB_USER}",
"SLURM_JOB_PARTITION": "${SLURM_JOB_PARTITION}",
"SLURM_CLUSTER_NAME": "${SLURM_CLUSTER_NAME}",
"SLURM_JOB_GPUS": "${SLURM_JOB_GPUS}",
"CUDA_VISIBLE_DEVICES": "${CUDA_VISIBLE_DEVICES}",
"SLURM_SCRIPT_CONTEXT": "${SLURM_SCRIPT_CONTEXT}"
}
EOF
)
[ -d ${EXPORT_DIR} ] || exit 0
GPUS=$(echo ${CUDA_VISIBLE_DEVICES} | tr "," "\n")
for GPUID in ${GPUS}; do
rm -f ${EXPORT_DIR}/${GPUID}
done
37 changes: 37 additions & 0 deletions slurm/exporter-prolog.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/bin/bash

#
#Copyright (c) Advanced Micro Devices, Inc. All rights reserved.

#Licensed under the Apache License, Version 2.0 (the \"License\");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an \"AS IS\" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
#

EXPORT_DIR="/var/run/exporter/"
MSG=$(
cat <<EOF
{
"SLURM_JOB_ID": "${SLURM_JOB_ID}",
"SLURM_JOB_USER": "${SLURM_JOB_USER}",
"SLURM_JOB_PARTITION": "${SLURM_JOB_PARTITION}",
"SLURM_CLUSTER_NAME": "${SLURM_CLUSTER_NAME}",
"SLURM_JOB_GPUS": "${SLURM_JOB_GPUS}",
"CUDA_VISIBLE_DEVICES": "${CUDA_VISIBLE_DEVICES}",
"SLURM_SCRIPT_CONTEXT": "${SLURM_SCRIPT_CONTEXT}"
}
EOF
)
[ -d ${EXPORT_DIR} ] || exit 0
GPUS=$(echo ${CUDA_VISIBLE_DEVICES} | tr "," "\n")
for GPUID in ${GPUS}; do
echo ${MSG} >${EXPORT_DIR}/${GPUID}
done