GitLab custom executor for Slurm

⭐ This project is actively maintained and contributions are welcomed !

This repository provides a GitLab custom executor for Slurm, allowing you to run your GitLab CI/CD jobs directly on your own Slurm cluster.

Requirement: This executor does not allow SSH access, the GitLab executor needs to be running on the same machine as the Slurm frontend.

Dependencies:

Python 3.6+ (no external module)
GitLab Runner 12.1.0+
UNIX environment

Runner setup

Clone this repository:

git clone https://github.com/Algebraic-Programming/slurm-gitlab-executor.git slurm-gitlab-executor
cd slurm-gitlab-executor

Create a configuration with GitLab Runner by following the registration instructions. This will generate a config.toml from which you can generate the final configuration.
```
gitlab-runner register --name slurm-gitlab-executor --url https://mydomain.gitlab.com --token glrt-t0k3n 
```

Use the generate-config.sh script to generate another config.toml:

./generate-config.sh /path/to/generated/config/dir/config.toml > /path/to/slurm-gitlab-executor/config.toml

You should end up with a configuration similar to this one:

check_interval = 0
shutdown_timeout = 0

# Amount of concurrent Slurm jobs that can be running at the same time
concurrent = 10

[[runners]]
    executor = "custom"

    # Values generated by "gitlab-runner register"
    name = "my-slurm-gitlab-executor" # This is the name of your runner
    url = "https://gitlab.my-domain.com/" # This is the URL of your GitLab instance
    id = "11" # This is the runner id
    token = "glrt-g1b3rr1sh" # This is the runner token
    token_obtained_at = "2023-01-01T00:00:00Z"
    token_expires_at = "2024-01-01T00:00:00Z"

    # Paths to the builds and cache directories
    # - must be absolute
    # - must be owned by the user running gitlab-runner
    # - must have enough space to store potential large artifacts
    builds_dir = "/path/to/slurm-gitlab-executor/wd/builds"
    cache_dir = "/path/to/slurm-gitlab-executor/wd/cache"

    # Un-comment me out during debugging
    # log_level = "debug"

    # Paths to the driver/main.py executable controlling the Slurm jobs
    [runners.custom]
        config_exec = "/path/to/slurm-gitlab-executor/driver/main.py"
        config_args = ["config", "/path/to/slurm-gitlab-executor/wd/builds", "/path/to/slurm-gitlab-executor/wd/cache"]
        prepare_exec = "/path/to/slurm-gitlab-executor/driver/main.py"
        prepare_args = ["prepare"]
        run_exec = "/path/to/slurm-gitlab-executor/driver/main.py"
        run_args = ["run"]
        cleanup_exec = "/path/to/slurm-gitlab-executor/driver/main.py"
        cleanup_args = ["cleanup"]

The runner can be executed in many ways:
- In a shell:
  ./gitlab-runner run --config /path/to/slurm-gitlab-executor/config.toml
- In a screen session :
  screen -S slurm-gitlab-executor -dmS ./gitlab-runner run --config /path/to/slurm-gitlab-executor/config.toml
- As a service (recommended): \
```
sudo cp /path/to/slurm-gitlab-executor/gitlab-runner.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable gitlab-runner
sudo systemctl start gitlab-runner
```

Job setup

Example job configuration (more detailed version here):

default:
  tags:
    - slurm

variables:
    CI_SLURM_PARTITION:
      value: "x86"
      description: "Slurm partition to use"
      options: ["x86", "arm"]
    CI_SLURM_NNODES: 
      value: "1"
      description: "Number of nodes"
    CI_SLURM_NTASKS: 
      value: "1"
      description: "Number of tasks"
    CI_SLURM_CPUS_PER_TASK: 
      value: "8"
      description: "Number of CPUs per task"
    CI_SLURM_MEM_PER_NODE: 
      value: "32G"
      description: "Memory available per node"
    CI_SLURM_TIMELIMIT:
      value: "00-01:00:00" # 0 days, 1 hour, 0 minutes, 0 seconds
      description: "Time limit (format: days-hours:minutes:seconds)"

# Simple job executed in a Slurm job with 8 cpus
my_parallel_slurm_job:
    script:
        - echo "Hello from the Slurm cluster, in job ${SLURM_JOB_NAME}!"
        - touch output_from_parallel_job.txt
    artifacts:
        paths:
          - output_from_parallel_job.txt

# Simple job executed in a Slurm job with only 1 cpu
my_sequential_slurm_job:
    variables:
      # Any Slurm variable can be overridden using the "LOCAL_" prefix
      LOCAL_CI_SLURM_CPUS_PER_TASK: "1"
    script:
        - echo "Hello from the Slurm cluster, in job ${SLURM_JOB_NAME}!"
        - touch output_from_sequential_job.txt
    artifacts:
        paths:
          - output_from_sequential_job.txt

# Simple job executed in a Docker, gathering the exported artifacts 
# from the two Slurm jobs above
my_docker_job:
  needs: [my_parallel_slurm_job, my_sequential_slurm_job]
  tags:
    - docker
  image: ubuntu:latest
  script:
    - ls -l
    - echo "Successful execution"

Supported Slurm options:

Based on the Slurm documentation, with every SBATCH_VAR replaced by SLURM_VAR.

GitLab CI variable	Corresponding sbatch parameter	Supported	Options / Format	Default
CI_SLURM_PARTITION	-p / --partition	✅
CI_SLURM_NNODES	-N / --nodes	✅	minnodes[-maxnodes]
CI_SLURM_MEM_PER_NODE	--mem	✅
CI_SLURM_MEM_BIND	--mem-bind	✅
CI_SLURM_MEM_PER_CPU	--mem-per-cpu	✅
CI_SLURM_CPUS_PER_TASK	-c / --cpus-per-task	✅
CI_SLURM_NTASKS	--ntasks	✅
CI_SLURM_TIMELIMIT	-t / --time	✅	days-hours:minutes:seconds	`00-02:00:00`
CI_SLURM_TIME_MIN	--time-min	✅
CI_SLURM_EXCLUSIVE	--exclusive	✅	`yes`, `no`	`no`
CI_SLURM_NETWORK	--network	✅
CI_SLURM_CONTIGUOUS	--network	✅	`yes`, `no`	`no`
CI_SLURM_POWER	--power	✅
CI_SLURM_CI_SLURM_PRIORITY	--priority	✅
CI_SLURM_CI_SLURM_NICE	--nice	✅
CI_SLURM_MEM_PER_CPU	--mem-per-cpu	✅
CI_SLURM_COMMENT	--comment	✅		`Automatic job created from GitLab CI`
CI_SLURM_GPUS	-G / --gpus	⌛
CI_SLURM_GPUS_PER_NODE	--gpus-per-node	⌛
CI_SLURM_GPUS_PER_TASK	--gpus-per-task	⌛
CI_SLURM_CPUS_PER_GPU	--cpus-per-gpu	⌛
Others...		❌

Supported runner options:

GitLab CI variable	Description	Supported	Options	Default
CI_KEEP_BUILD_DIR	If true, the build folder in the Slurm cluster will not get removed after a successful execution of the job	✅	`yes`, `no`	`no`
CI_LOG_LEVEL_SLURM_EXECUTOR	Log level of the executor	✅	`debug`, `info`, `none`	`info`
SLURM_JOB_START_TIMEOUT_SECONDS	Time limit to wait for a Slurm job to pass from PENDING to RUNNING before considering failure (in seconds)	✅		`1200`
SLURM_JOB_STOP_TIMEOUT_SECONDS_BEFORE_CANCEL	Time limit to wait before cancelling a Slurm job that received a stop request (in seconds)	✅		`30`

Troubleshooting

Q: My CI job was killed for timeout because the Slurm job stayed PENDING for too long, what should I do ?

A: Increase this GitLab variable: Settings > CI/CD > Runners > edit-icon > Maximum job timeout. This value is the maximum time you expect your CI job to run + the time to schedule it on the cluster.

Q: My job is killed for timeout because the job stays RUNNING for too long, what should I do ?

A: Increase this Slurm variable: CI_SLURM_TIMELIMIT (format is days-hours:minutes:seconds).

Q: Sometimes I can not see the logs of my failed jobs, what should I do ?

A: If for any reason the job fails before the logs are sent back to GitLab, you can still access them on the cluster. At the beginning of the job, the executor will print the path to the logs on the cluster. You can then connect to the cluster and check the logs manually.

Q: I noticed that some jobs stayed active after the job was finished, what should I do ?

A: The maximum time a job can stay active without receiving any update is 10 minutes. Past this time, the job is supposed to stop by itself. If it's not the case, please create an issue on GitHub and provide as much information as you can.

Copyright and Licensing

  Copyright 2024 Huawei Technologies Co., Ltd.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
driver		driver
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
LICENSE		LICENSE
README.md		README.md
config_example.toml		config_example.toml
generate-config.sh		generate-config.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitLab custom executor for Slurm

Runner setup

Job setup

Supported Slurm options:

Supported runner options:

Troubleshooting

Copyright and Licensing

About

Releases

Packages

Contributors 2

Languages

License

Algebraic-Programming/slurm-gitlab-executor

Folders and files

Latest commit

History

Repository files navigation

GitLab custom executor for Slurm

Runner setup

Job setup

Supported Slurm options:

Supported runner options:

Troubleshooting

Copyright and Licensing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages