Reduce the default jobs/checks per second

Don't want to stress Slurm unnecessarily since it will impact all HPC users
jdblischak · Sep 17, 2021 · 76b1ed2 · 76b1ed2
1 parent 4f4d80e
commit 76b1ed2
Show file tree

Hide file tree

Showing 2 changed files with 25 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -6,6 +6,7 @@
 * [Limitations](#limitations)
 * [Quick start](#quick-start)
 * [Customizations](#customizations)
+* [Use speed with caution](#use-speed-with-caution)
 * [License](#license)
 
 The option [`--cluster-config`][cluster-config] is deprecated, but it's still
@@ -32,7 +33,8 @@ post][sichong-post] by Sichong Peng nicely explains this strategy for replacing
 
 * Fast! It can quickly submit jobs and check their status because it doesn't
   invoke a Python script for these steps, which adds up when you have thousands
-  of jobs
+  of jobs (however, please see the section [Use speed with
+  caution](#use-speed-with-caution))
 
 * No reliance on the deprecated option `--cluster-config` to customize job
   resources
@@ -282,6 +284,26 @@ documentation below.
    latest attempt. Also, please upvote my [PR][pr-multi-cluster] to fix this in
    Snakemake.
 
+## Use speed with caution
+
+A big benefit of the simplicity of this profile is the speed in which jobs can
+be submitted and their statuses checked. The [official Slurm profile for
+Snakemake][slurm-official] provides a lot of extra fine-grained control, but
+this is all defined in Python scripts, which then have to be invoked for each
+job submission and status check. I needed this speed for a pipeline that had an
+aggregation rule that needed to be run tens of thousands of times, and the run
+time for each job was under 10 seconds. In this situation, the job submission
+rate and status check rate were huge bottlenecks.
+
+However, you should use this speed with caution! On a shared HPC cluster, many
+users are making requests to the Slurm scheduler. If too many requests are made
+at once, the performance will suffer for all users. If the rules in your
+Snakemake pipeline take at least more than a few minutes to complete, then it's
+overkill to constantly check the status of multiple jobs in a single second. In
+other words, only increase `max-jobs-per-second` and/or
+`max-status-checks-per-second` if either the submission rate or status checks to
+confirm job completion are clear bottlenecks.
+
 ## License
 
 This is all boiler plate code. Please feel free to use it for whatever purpose

diff --git a/simple/config.yaml b/simple/config.yaml
@@ -12,8 +12,8 @@ default-resources:
   - qos=<name-of-quality-of-service>
   - mem_mb=1000
 restart-times: 3
-max-jobs-per-second: 100
-max-status-checks-per-second: 10
+max-jobs-per-second: 10
+max-status-checks-per-second: 1
 local-cores: 1
 latency-wait: 60
 jobs: 500