Running Sprocket on a Slurm cluster
Sprocket is a workflow execution engine for the Workflow Description Language (WDL). It can dispatch individual task executions to an HPC cluster using Slurm for job scheduling and Apptainer as the container runtime, allowing you to run WDL workflows at scale on existing HPC infrastructure.
This guide is intended for system administrators and power users looking to configure Sprocket for their HPC environment. It walks through the entire process of getting Sprocket running on a Slurm cluster: installing the binary, configuring the backend, running your first workflow, and tuning for production use. By the end, you will have Sprocket submitting containerized WDL tasks as Slurm jobs.
WARNING
The Slurm + Apptainer backend is experimental, and its behavior and configuration may change between Sprocket releases.
Prerequisites
Before starting, verify that your environment meets the following requirements.
Slurm command-line tools (25.05.0 or later) must be available on the login/submission node:
shellsbatch --version squeue --version scancel --versionApptainer (1.3.6 or later) must be installed on the compute nodes where Slurm dispatches jobs:
shellapptainer --versionA shared filesystem (e.g., Lustre, GPFS, or NFS) must be accessible from both the login node and the compute nodes. Sprocket writes its output directory to this filesystem, and the compute nodes read task scripts and write results back to it.
Network access from compute nodes to container registries (Docker Hub, Quay, or your organization's private registry) is required for pulling container images. If your compute nodes lack outbound internet access, you will need to pre-pull images or configure a local registry mirror.
Installing Sprocket
The simplest approach on an HPC cluster is to download a pre-built binary from the GitHub releases page and place it on the shared filesystem.
# Determine the latest version
VERSION=$(curl -s https://api.github.com/repos/stjude-rust-labs/sprocket/releases/latest | grep '"tag_name"' | cut -d '"' -f 4)
# Download the release (adjust the platform as needed)
curl -L -o sprocket.tar.gz "https://github.com/stjude-rust-labs/sprocket/releases/download/${VERSION}/sprocket-${VERSION}-x86_64-unknown-linux-gnu.tar.gz"
# Extract the binary
tar xzf sprocket.tar.gz
# Move it somewhere on the shared filesystem
mv sprocket /shared/software/bin/sprocketVerify the installation:
sprocket --versionAlternatively, if a Rust toolchain is available, you can install from source:
cargo install sprocket --lockedTIP
If your site uses environment modules, consider creating a module file for Sprocket so users can load it with module load sprocket. Spack is another common option for managing software on HPC clusters.
Setting up a shared configuration
For a multi-user cluster, it is common to provide a site-wide sprocket.toml that all users inherit. There are two recommended approaches:
Executable-adjacent configuration. Place a
sprocket.tomlin the same directory as thesprocketbinary (e.g.,/shared/software/bin/sprocket.toml). Sprocket automatically loads this file, making it a natural fit when the binary is installed to a shared location.Environment variable. Set
SPROCKET_CONFIGto point to a shared configuration file. This is useful when the binary is managed separately (e.g., installed viacargo install) or when you need different configurations for different groups of users:shellexport SPROCKET_CONFIG=/shared/config/sprocket.tomlIf your site uses environment modules, add this export to the module file so it is set automatically when users run
module load sprocket.
Users can still override settings by placing their own sprocket.toml in their working directory or by passing --config on the command line. See the configuration overview for the full load order and precedence rules.
Configuring the backend
The following example configures Sprocket to use Slurm + Apptainer as its default backend. This is a good starting point for a shared sprocket.toml:
# Enable experimental features (required for the Slurm backend).
[run]
experimental_features_enabled = true
# Use the Slurm + Apptainer backend.
[run.backends.default]
type = "slurm_apptainer"
# Default partition for task execution.
#
# If omitted, jobs are submitted to your cluster's default partition.
default_slurm_partition.name = "compute"
default_slurm_partition.max_cpu_per_task = 64
default_slurm_partition.max_memory_per_task = "96 GB"
# Optional: dedicated partition for short tasks.
# short_task_slurm_partition.name = "short"
# Optional: dedicated partition for GPU tasks.
# gpu_slurm_partition.name = "gpu"
# Optional: dedicated partition for FPGA tasks.
# fpga_slurm_partition.name = "fpga"
# Additional arguments passed to `sbatch` when submitting jobs.
# For example, set a default time limit for all jobs.
# extra_sbatch_args = ["--time=60"]
# Additional arguments passed to `apptainer exec`.
# For example, pass `--nv` to enable GPU support inside containers.
# extra_apptainer_exec_args = ["--nv"]Resource limit behavior
Each partition can declare the largest CPU and memory allocation it supports:
[run.backends.default]
default_slurm_partition.max_cpu_per_task = 64
default_slurm_partition.max_memory_per_task = "96 GB"When a WDL task requests more than these limits, Sprocket's behavior is controlled by two settings:
[run.task]
cpu_limit_behavior = "deny"
memory_limit_behavior = "deny""deny"(default) — Sprocket refuses to run the task and reports an error."try_with_max"— Sprocket clamps the request to the partition's maximum and attempts to run the task anyway. This does not guarantee success, but it avoids failing before the task has a chance to run.
If max_cpu_per_task and max_memory_per_task are not set on a partition, these settings have no effect and Sprocket submits the task's resource request as-is.
Running your first workflow
Create a file called hello.wdl:
version 1.3
task say_hello {
input {
String greeting
}
command <<<
echo "~{greeting}, world!"
>>>
output {
String message = read_string(stdout())
}
requirements {
container: "ubuntu:latest"
}
}Run it:
sprocket run hello.wdl --target say_hello greeting="Hello"Sprocket will pull the ubuntu:latest container image (converting it to an Apptainer SIF file), submit a job via sbatch, wait for the job to complete, and collect the outputs. You can monitor the Slurm job while it runs:
squeue -u $USEROnce the run completes, the output directory will contain:
out/
├── sprocket.db
└── runs/
└── say_hello/
└── <timestamp>/
├── output.log
├── inputs.json
├── outputs.json
└── attempts/
└── 0/
├── command
├── stdout
├── stderr
└── work/Production considerations
Output directory placement
Place the output directory on the shared filesystem so that both the login node and compute nodes can access it. Use the -o flag to specify the location:
sprocket run workflow.wdl --target main -o /shared/results/my-projectScatter concurrency
The run.workflow.scatter.concurrency setting controls how many elements within a scatter block are evaluated concurrently. The default is 1000:
run.workflow.scatter.concurrency = 1000Setting this too high can put pressure on the scheduler by queueing a large number of jobs at once. Note that each scattered task may itself request multiple CPUs, so the actual resource consumption can be a multiple of this number.
Container image caching
Sprocket pulls container images and converts them to SIF files in an apptainer-images/ directory inside each run's timestamped directory. A given image is only pulled once within a run, but each new run pulls all of its images fresh. For workflows that use many large images, you can pre-pull images to a shared location using apptainer pull and reference the local SIF path in your WDL container declarations:
apptainer pull /shared/containers/ubuntu_latest.sif docker://ubuntu:latestProvenance database
The sprocket.db SQLite database lives in the output directory and records every run, task, and session. SQLite works over shared filesystems, but for heavily concurrent workloads, back up the database regularly. See the provenance tracking documentation for details.
Failure mode
The run.fail setting controls what happens when a task fails:
"slow"(default) — Sprocket waits for all currently running tasks to complete before exiting."fast"— Sprocket cancels all running tasks immediately viascanceland exits.
[run]
fail = "fast"Monitoring and troubleshooting
Checking job status
While a workflow is running, you can use squeue to see the Slurm jobs that Sprocket has submitted:
squeue -u $USERInspecting run output
Each task attempt writes its files to out/runs/<target>/<timestamp>/attempts/<n>/:
| File | Contents |
|---|---|
command | The shell script that was executed inside the container |
stdout | Standard output from the task |
stderr | Standard error from the task |
work/ | The task's working directory, containing any output files |
When troubleshooting a failed task, start with stderr and command to understand what ran and what went wrong. See the provenance tracking documentation for a full description of the run directory structure.
Common issues
Job pending indefinitely. The task's resource request (CPU or memory) may exceed what the partition can provide. Check with
scontrol show job <jobid>and compare against your partition limits. Consider settingcpu_limit_behavior = "try_with_max"or adjusting the partition'smax_cpu_per_taskandmax_memory_per_task.Container pull failure. Compute nodes may lack network access to the container registry. Check whether the nodes can reach the registry, or pre-pull the image and reference a local path.
Permission errors. The shared filesystem must be writable by the user running Sprocket and by the Slurm jobs on the compute nodes. Verify that directory permissions are consistent across nodes.
Apptainer not found on compute nodes. Ensure Apptainer is installed and on the
PATHfor Slurm jobs. If Apptainer is provided via an environment module, it must be loaded in the user's environment before runningsprocket runso that the job inherits the correctPATH.
Getting help
If you run into problems or have feedback, join the OpenWDL Slack and reach out in the #sprocket channel.

