⭐ This project is actively maintained and contributions are welcomed !
This repository provides a GitLab custom executor for Slurm, allowing you to run your GitLab CI/CD jobs directly on your own Slurm cluster.
Requirement: This executor does not allow SSH access, the GitLab executor needs to be running on the same machine as the Slurm frontend.
Dependencies:
- Python 3.6+ (no external module)
- GitLab Runner 12.1.0+
- UNIX environment
-
Clone this repository:
git clone https://github.com/Algebraic-Programming/slurm-gitlab-executor.git slurm-gitlab-executor cd slurm-gitlab-executor
-
Create a configuration with GitLab Runner by following the registration instructions. This will generate a config.toml from which you can generate the final configuration.
gitlab-runner register --name slurm-gitlab-executor --url https://mydomain.gitlab.com --token glrt-t0k3n
-
Use the
generate-config.sh
script to generate another config.toml:./generate-config.sh /path/to/generated/config/dir/config.toml > /path/to/slurm-gitlab-executor/config.toml
You should end up with a configuration similar to this one:
check_interval = 0 shutdown_timeout = 0 # Amount of concurrent Slurm jobs that can be running at the same time concurrent = 10 [[runners]] executor = "custom" # Values generated by "gitlab-runner register" name = "my-slurm-gitlab-executor" # This is the name of your runner url = "https://gitlab.my-domain.com/" # This is the URL of your GitLab instance id = "11" # This is the runner id token = "glrt-g1b3rr1sh" # This is the runner token token_obtained_at = "2023-01-01T00:00:00Z" token_expires_at = "2024-01-01T00:00:00Z" # Paths to the builds and cache directories # - must be absolute # - must be owned by the user running gitlab-runner # - must have enough space to store potential large artifacts builds_dir = "/path/to/slurm-gitlab-executor/wd/builds" cache_dir = "/path/to/slurm-gitlab-executor/wd/cache" # Un-comment me out during debugging # log_level = "debug" # Paths to the driver/main.py executable controlling the Slurm jobs [runners.custom] config_exec = "/path/to/slurm-gitlab-executor/driver/main.py" config_args = ["config", "/path/to/slurm-gitlab-executor/wd/builds", "/path/to/slurm-gitlab-executor/wd/cache"] prepare_exec = "/path/to/slurm-gitlab-executor/driver/main.py" prepare_args = ["prepare"] run_exec = "/path/to/slurm-gitlab-executor/driver/main.py" run_args = ["run"] cleanup_exec = "/path/to/slurm-gitlab-executor/driver/main.py" cleanup_args = ["cleanup"]
-
The runner can be executed in many ways:
- In a shell:
./gitlab-runner run --config /path/to/slurm-gitlab-executor/config.toml
- In a screen session :
screen -S slurm-gitlab-executor -dmS ./gitlab-runner run --config /path/to/slurm-gitlab-executor/config.toml
- As a service (recommended): \
sudo cp /path/to/slurm-gitlab-executor/gitlab-runner.service /etc/systemd/system/ sudo systemctl daemon-reload sudo systemctl enable gitlab-runner sudo systemctl start gitlab-runner
- In a shell:
- Example job configuration (more detailed version here):
default: tags: - slurm variables: CI_SLURM_PARTITION: value: "x86" description: "Slurm partition to use" options: ["x86", "arm"] CI_SLURM_NNODES: value: "1" description: "Number of nodes" CI_SLURM_NTASKS: value: "1" description: "Number of tasks" CI_SLURM_CPUS_PER_TASK: value: "8" description: "Number of CPUs per task" CI_SLURM_MEM_PER_NODE: value: "32G" description: "Memory available per node" CI_SLURM_TIMELIMIT: value: "00-01:00:00" # 0 days, 1 hour, 0 minutes, 0 seconds description: "Time limit (format: days-hours:minutes:seconds)" # Simple job executed in a Slurm job with 8 cpus my_parallel_slurm_job: script: - echo "Hello from the Slurm cluster, in job ${SLURM_JOB_NAME}!" - touch output_from_parallel_job.txt artifacts: paths: - output_from_parallel_job.txt # Simple job executed in a Slurm job with only 1 cpu my_sequential_slurm_job: variables: # Any Slurm variable can be overridden using the "LOCAL_" prefix LOCAL_CI_SLURM_CPUS_PER_TASK: "1" script: - echo "Hello from the Slurm cluster, in job ${SLURM_JOB_NAME}!" - touch output_from_sequential_job.txt artifacts: paths: - output_from_sequential_job.txt # Simple job executed in a Docker, gathering the exported artifacts # from the two Slurm jobs above my_docker_job: needs: [my_parallel_slurm_job, my_sequential_slurm_job] tags: - docker image: ubuntu:latest script: - ls -l - echo "Successful execution"
Based on the Slurm documentation, with every SBATCH_VAR replaced by SLURM_VAR.
GitLab CI variable | Corresponding sbatch parameter | Supported | Options / Format | Default |
---|---|---|---|---|
CI_SLURM_PARTITION | -p / --partition | ✅ | ||
CI_SLURM_NNODES | -N / --nodes | ✅ | minnodes[-maxnodes] | |
CI_SLURM_MEM_PER_NODE | --mem | ✅ | ||
CI_SLURM_MEM_BIND | --mem-bind | ✅ | ||
CI_SLURM_MEM_PER_CPU | --mem-per-cpu | ✅ | ||
CI_SLURM_CPUS_PER_TASK | -c / --cpus-per-task | ✅ | ||
CI_SLURM_NTASKS | --ntasks | ✅ | ||
CI_SLURM_TIMELIMIT | -t / --time | ✅ | days-hours:minutes:seconds | 00-02:00:00 |
CI_SLURM_TIME_MIN | --time-min | ✅ | ||
CI_SLURM_EXCLUSIVE | --exclusive | ✅ | yes , no |
no |
CI_SLURM_NETWORK | --network | ✅ | ||
CI_SLURM_CONTIGUOUS | --network | ✅ | yes , no |
no |
CI_SLURM_POWER | --power | ✅ | ||
CI_SLURM_CI_SLURM_PRIORITY | --priority | ✅ | ||
CI_SLURM_CI_SLURM_NICE | --nice | ✅ | ||
CI_SLURM_MEM_PER_CPU | --mem-per-cpu | ✅ | ||
CI_SLURM_COMMENT | --comment | ✅ | Automatic job created from GitLab CI |
|
CI_SLURM_GPUS | -G / --gpus | ⌛ | ||
CI_SLURM_GPUS_PER_NODE | --gpus-per-node | ⌛ | ||
CI_SLURM_GPUS_PER_TASK | --gpus-per-task | ⌛ | ||
CI_SLURM_CPUS_PER_GPU | --cpus-per-gpu | ⌛ | ||
Others... | ❌ |
GitLab CI variable | Description | Supported | Options | Default |
---|---|---|---|---|
CI_KEEP_BUILD_DIR | If true, the build folder in the Slurm cluster will not get removed after a successful execution of the job | ✅ | yes , no |
no |
CI_LOG_LEVEL_SLURM_EXECUTOR | Log level of the executor | ✅ | debug , info , none |
info |
SLURM_JOB_START_TIMEOUT_SECONDS | Time limit to wait for a Slurm job to pass from PENDING to RUNNING before considering failure (in seconds) | ✅ | 1200 |
|
SLURM_JOB_STOP_TIMEOUT_SECONDS_BEFORE_CANCEL | Time limit to wait before cancelling a Slurm job that received a stop request (in seconds) | ✅ | 30 |
Q: My CI job was killed for timeout because the Slurm job stayed PENDING for too long, what should I do ?
A: Increase this GitLab variable: Settings > CI/CD > Runners > edit-icon > Maximum job timeout. This value is the maximum time you expect your CI job to run + the time to schedule it on the cluster.
Q: My job is killed for timeout because the job stays RUNNING for too long, what should I do ?
A: Increase this Slurm variable: CI_SLURM_TIMELIMIT (format is days-hours:minutes:seconds).
Q: Sometimes I can not see the logs of my failed jobs, what should I do ?
A: If for any reason the job fails before the logs are sent back to GitLab, you can still access them on the cluster. At the beginning of the job, the executor will print the path to the logs on the cluster. You can then connect to the cluster and check the logs manually.
Q: I noticed that some jobs stayed active after the job was finished, what should I do ?
A: The maximum time a job can stay active without receiving any update is 10 minutes. Past this time, the job is supposed to stop by itself. If it's not the case, please create an issue on GitHub and provide as much information as you can.
Copyright 2024 Huawei Technologies Co., Ltd. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.