Skip to content

Running Sprocket on a Slurm cluster

Sprocket is a workflow execution engine for the Workflow Description Language (WDL). It can dispatch individual task executions to an HPC cluster using Slurm for job scheduling and Apptainer as the container runtime, allowing you to run WDL workflows at scale on existing HPC infrastructure.

This guide is intended for system administrators and power users looking to configure Sprocket for their HPC environment. It walks through the entire process of getting Sprocket running on a Slurm cluster: installing the binary, configuring the backend, running your first workflow, and tuning for production use. By the end, you will have Sprocket submitting containerized WDL tasks as Slurm jobs.

WARNING

The Slurm + Apptainer backend is experimental, and its behavior and configuration may change between Sprocket releases.

Prerequisites

Before starting, verify that your environment meets the following requirements.

  • Slurm command-line tools (25.05.0 or later) must be available on the login/submission node:

    shell
    sbatch --version
    squeue --version
    scancel --version
  • Apptainer (1.3.6 or later) must be installed on the compute nodes where Slurm dispatches jobs:

    shell
    apptainer --version
  • A shared filesystem (e.g., Lustre, GPFS, or NFS) must be accessible from both the login node and the compute nodes. Sprocket writes its output directory to this filesystem, and the compute nodes read task scripts and write results back to it.

  • Network access from compute nodes to container registries (Docker Hub, Quay, or your organization's private registry) is required for pulling container images. If your compute nodes lack outbound internet access, you will need to pre-pull images or configure a local registry mirror.

Installing Sprocket

The simplest approach on an HPC cluster is to download a pre-built binary from the GitHub releases page and place it on the shared filesystem.

shell
# Determine the latest version
VERSION=$(curl -s https://api.github.com/repos/stjude-rust-labs/sprocket/releases/latest | grep '"tag_name"' | cut -d '"' -f 4)

# Download the release (adjust the platform as needed)
curl -L -o sprocket.tar.gz "https://github.com/stjude-rust-labs/sprocket/releases/download/${VERSION}/sprocket-${VERSION}-x86_64-unknown-linux-gnu.tar.gz"

# Extract the binary
tar xzf sprocket.tar.gz

# Move it somewhere on the shared filesystem
mv sprocket /shared/software/bin/sprocket

Verify the installation:

shell
sprocket --version

Alternatively, if a Rust toolchain is available, you can install from source:

shell
cargo install sprocket --locked

TIP

If your site uses environment modules, consider creating a module file for Sprocket so users can load it with module load sprocket. Spack is another common option for managing software on HPC clusters.

Setting up a shared configuration

For a multi-user cluster, it is common to provide a site-wide sprocket.toml that all users inherit. There are two recommended approaches:

  • Executable-adjacent configuration. Place a sprocket.toml in the same directory as the sprocket binary (e.g., /shared/software/bin/sprocket.toml). Sprocket automatically loads this file, making it a natural fit when the binary is installed to a shared location.

  • Environment variable. Set SPROCKET_CONFIG to point to a shared configuration file. This is useful when the binary is managed separately (e.g., installed via cargo install) or when you need different configurations for different groups of users:

    shell
    export SPROCKET_CONFIG=/shared/config/sprocket.toml

    If your site uses environment modules, add this export to the module file so it is set automatically when users run module load sprocket.

Users can still override settings by placing their own sprocket.toml in their working directory or by passing --config on the command line. See the configuration overview for the full load order and precedence rules.

Configuring the backend

The following example configures Sprocket to use Slurm + Apptainer as its default backend. This is a good starting point for a shared sprocket.toml:

toml
# Enable experimental features (required for the Slurm backend).
[run]
experimental_features_enabled = true

# Use the Slurm + Apptainer backend.
[run.backends.default]
type = "slurm_apptainer"

# Default partition for task execution.
#
# If omitted, jobs are submitted to your cluster's default partition.
default_slurm_partition.name = "compute"
default_slurm_partition.max_cpu_per_task = 64
default_slurm_partition.max_memory_per_task = "96 GB"

# Optional: dedicated partition for short tasks.
# short_task_slurm_partition.name = "short"

# Optional: dedicated partition for GPU tasks.
# gpu_slurm_partition.name = "gpu"

# Optional: dedicated partition for FPGA tasks.
# fpga_slurm_partition.name = "fpga"

# Additional arguments passed to `sbatch` when submitting jobs.
# For example, set a default time limit for all jobs.
# extra_sbatch_args = ["--time=60"]

# Additional arguments passed to `apptainer exec`.
# For example, pass `--nv` to enable GPU support inside containers.
# extra_apptainer_exec_args = ["--nv"]

Resource limit behavior

Each partition can declare the largest CPU and memory allocation it supports:

toml
[run.backends.default]
default_slurm_partition.max_cpu_per_task = 64
default_slurm_partition.max_memory_per_task = "96 GB"

When a WDL task requests more than these limits, Sprocket's behavior is controlled by two settings:

toml
[run.task]
cpu_limit_behavior = "deny"
memory_limit_behavior = "deny"
  • "deny" (default) — Sprocket refuses to run the task and reports an error.
  • "try_with_max" — Sprocket clamps the request to the partition's maximum and attempts to run the task anyway. This does not guarantee success, but it avoids failing before the task has a chance to run.

If max_cpu_per_task and max_memory_per_task are not set on a partition, these settings have no effect and Sprocket submits the task's resource request as-is.

Running your first workflow

Create a file called hello.wdl:

wdl
version 1.3

task say_hello {
    input {
        String greeting
    }

    command <<<
        echo "~{greeting}, world!"
    >>>

    output {
        String message = read_string(stdout())
    }

    requirements {
        container: "ubuntu:latest"
    }
}

Run it:

shell
sprocket run hello.wdl --target say_hello greeting="Hello"

Sprocket will pull the ubuntu:latest container image (converting it to an Apptainer SIF file), submit a job via sbatch, wait for the job to complete, and collect the outputs. You can monitor the Slurm job while it runs:

shell
squeue -u $USER

Once the run completes, the output directory will contain:

out/
├── sprocket.db
└── runs/
    └── say_hello/
        └── <timestamp>/
            ├── output.log
            ├── inputs.json
            ├── outputs.json
            └── attempts/
                └── 0/
                    ├── command
                    ├── stdout
                    ├── stderr
                    └── work/

Production considerations

Output directory placement

Place the output directory on the shared filesystem so that both the login node and compute nodes can access it. Use the -o flag to specify the location:

shell
sprocket run workflow.wdl --target main -o /shared/results/my-project

Scatter concurrency

The run.workflow.scatter.concurrency setting controls how many elements within a scatter block are evaluated concurrently. The default is 1000:

toml
run.workflow.scatter.concurrency = 1000

Setting this too high can put pressure on the scheduler by queueing a large number of jobs at once. Note that each scattered task may itself request multiple CPUs, so the actual resource consumption can be a multiple of this number.

Container image caching

Sprocket pulls container images and converts them to SIF files in an apptainer-images/ directory inside each run's timestamped directory. A given image is only pulled once within a run, but each new run pulls all of its images fresh. For workflows that use many large images, you can pre-pull images to a shared location using apptainer pull and reference the local SIF path in your WDL container declarations:

shell
apptainer pull /shared/containers/ubuntu_latest.sif docker://ubuntu:latest

Provenance database

The sprocket.db SQLite database lives in the output directory and records every run, task, and session. SQLite works over shared filesystems, but for heavily concurrent workloads, back up the database regularly. See the provenance tracking documentation for details.

Failure mode

The run.fail setting controls what happens when a task fails:

  • "slow" (default) — Sprocket waits for all currently running tasks to complete before exiting.
  • "fast" — Sprocket cancels all running tasks immediately via scancel and exits.
toml
[run]
fail = "fast"

Monitoring and troubleshooting

Checking job status

While a workflow is running, you can use squeue to see the Slurm jobs that Sprocket has submitted:

shell
squeue -u $USER

Inspecting run output

Each task attempt writes its files to out/runs/<target>/<timestamp>/attempts/<n>/:

FileContents
commandThe shell script that was executed inside the container
stdoutStandard output from the task
stderrStandard error from the task
work/The task's working directory, containing any output files

When troubleshooting a failed task, start with stderr and command to understand what ran and what went wrong. See the provenance tracking documentation for a full description of the run directory structure.

Common issues

  • Job pending indefinitely. The task's resource request (CPU or memory) may exceed what the partition can provide. Check with scontrol show job <jobid> and compare against your partition limits. Consider setting cpu_limit_behavior = "try_with_max" or adjusting the partition's max_cpu_per_task and max_memory_per_task.

  • Container pull failure. Compute nodes may lack network access to the container registry. Check whether the nodes can reach the registry, or pre-pull the image and reference a local path.

  • Permission errors. The shared filesystem must be writable by the user running Sprocket and by the Slurm jobs on the compute nodes. Verify that directory permissions are consistent across nodes.

  • Apptainer not found on compute nodes. Ensure Apptainer is installed and on the PATH for Slurm jobs. If Apptainer is provided via an environment module, it must be loaded in the user's environment before running sprocket run so that the job inherits the correct PATH.

Getting help

If you run into problems or have feedback, join the OpenWDL Slack and reach out in the #sprocket channel.