Job Submission in Slurm

Now that you understand the basic types of jobs, you are ready to submit a job.

When you submit a new job, the scheduler places your job in the queue until it can find available resources to assign.

Slurm Job Submission Commands

Slurm has three different job submission commands: salloc, sbatch, and srun. Users will need to understand the difference between these commands to determine which command is best suited to the task they are trying to accomplish.

The `salloc` command

The salloc command is used to request resource allocation for interactive jobs.

salloc [JOB-SUBMISSION-FLAGS]

By default, without additional options, salloc will request one compute node, one CPU core, and one gigabyte of memory for thirty minutes. To customize your resource request, you can pass one or more job submission flags (such as --time or --mem) to specify what you need.

Once you submit your salloc job, it will follow this life cycle:

If the scheduler accepts your job, salloc will pause while your job waits in the queue for resources to be allocated.
When the job is allocated, salloc will provide an interactive shell session where you can type commands.
When your shell session exits, salloc will automatically mark your job as completed and release the resources for use by other jobs.

To learn more, you can read the salloc manual page.

Example usage of salloc

Let's say that we need the following for a job:

one node
four CPU cores
8 GB of memory
2 hours of wall time

The command to start this example job would be:

salloc --cpus-per-task 4 --mem 8GB --time 02:00:00

Example output of salloc:

salloc: Pending job allocation 2113
salloc: job 2113 queued and waiting for resources
salloc: job 2113 has been allocated resources
salloc: Granted job allocation 2113
salloc: Nodes node0405 are ready for job

The `sbatch` command

The sbatch command is used to submit batch jobs that will run in the background.

salloc [JOB-SUBMISSION-FLAGS] [SCRIPT-FILE]

Users must pass the path to a shell script (typically a .sh file) that will be invoked by sbatch when your job starts. The script must specify both the resource requirements for the job and what commands to run.

Your shell script should use special #SBATCH comments at the top of the file to specify resource requirements. You can use any of the standard job submission flags. If desired, you may override flags specified in the script file by passing them to sbatch before the script file name.

Below your #SBATCH comments, you should specify one or more commands that you would like to run during your task. We strongly recommend using srun to run commands in your batch script to make debugging easier.

Once you submit your sbatch job, it will follow this life cycle:

If the scheduler accepts your job, sbatch will output the Job ID for your newly created job.
After submission, you may disconnect from Palmetto and work on other things.
When the job is allocated, the shell script that you provided to sbatch will be executed.
When your shell script exits, the job will be marked as completed and resources will be released for use by other jobs.

Job output is saved to a file on disk by default. See the control page for instructions on how to view job output during or after execution.

To learn more, you can read the sbatch manual page.

Example usage of sbatch

First, we will need to create a shell script for our job. These can be created with any text editor (like vim or nano) and must be stored on the cluster.

For this example, we will create myjob.sh in our home directory with the following contents:

#!/bin/bash

#SBATCH --job-name my-job-name
#SBATCH --nodes 2
#SBATCH --tasks-per-node 2
#SBATCH --cpus-per-task 8
#SBATCH --mem 12gb
#SBATCH --time 02:00:00
#SBATCH --gpus-per-node a100:1

srun python3 run-my-science-workflow.py

Notice the special #SBATCH comment lines, which will determine what resources our job requires. To learn more about each of these flags, please review the job submission flags section.

To submit this job, you would run the following command:

sbatch myjob.sh

Example output of sbatch:

sbatch: Submit checks complete!
Submitted batch job 2349

The `srun` command

The srun command is typically used to launch programs as a parallel job within a batch job submitted by sbatch.

srun [JOB-SUBMISSION-FLAGS] [COMMAND-TO-RUN] [FLAGS-FOR-COMMAND-TO-RUN]

note

srun may also be used outside of sbatch scripts to request other parallel job allocations, which is not covered on this page.

Using srun provides better integration with the Slurm scheduler and can help with debugging individual steps of a job. This is especially helpful when you are running more than one task (--ntasks greater than 1), possibly even across multiple nodes (--nodes greater than 1).

When used inside of an sbatch job script, commands passed to srun are run on all nodes/tasks in the job by default. You can alter this behavior using the --ntasks and --nodes flags.

If you would like to further subdivide the resources requested by your job, you can specify additional job submission flags.

To learn more, you can read the srun manual page.

Basic example usage of srun within an sbatch job with multiple nodes

Let's say we have submitted a batch job using sbatch myjob.sh and the myjob.sh file has the following contents:

#!/bin/bash

#SBATCH --ntasks 3
#SBATCH --nodes 3

srun --nodes 1 python3 server.py
srun --nodes 2 python3 worker.py

This batch job will request 3 nodes and then uses srun to execute a server program on one and a worker program on two others.

Slurm Job Submission Flags

When submitting a job, you will need to use a combination of flags to specify which resources your job needs.

You can specify just the flags you need. For example, if you are not using a GPU, you do not need to specify --gpus-per-task 0.

Recommended Flags

At a minimum, we recommend that you specify wall time, CPU count, and memory.

Wall Time

Wall time is the maximum amount of time your job should take to complete. This amount is measured in real time (like a wall clock), which is different from CPU time.

The timer begins when your job starts executing, so the time your job spends waiting in the queue is not included in your wall time limit. Once the timer expires, your job will be terminated if it is still running.

Use flag --time (or lowercase -t for short), and format your request as hh:mm:ss.

For example, to ask for 80 hours, 39 minutes, and 20 seconds:

--time 80:39:20

Wall Time Limits

To help us better allocate the limited pool of shared resources, there is a maximum amount of wall time that users may request for a job. The limit varies based on job source and the destination partition you submit to.

Job Source	General Partitions	Owner Partitions
Batch Job Submission (via `sbatch`)	72 hours (3 days)	336 hours (14 days)
Interactive Job Submission (via `salloc`)	12 hours	336 hours (14 days)
OnDemand Interactive App Session	12 hours	12 hours

CPU Core Count

By default, jobs will only receive one CPU core per task.

Use flag --cpus-per-task (or lowercase -c for short).

For example, to ask for 7 CPU cores:

--cpus-per-task 7

Memory

There are many ways to specify the amount of memory your job requires in Slurm.

The best route will depend on how many tasks your job requests.

Single-Task Jobs
Multi-Task Jobs

This recommendation applies only to single-task jobs (where ntasks is not specified or ntasks=1).

Use flag --mem and provide the amount of memory you need.

For example, to ask for 2.5 gigabytes:

--mem 2.5gb

For advanced users with multi-task jobs (where ntasks > 0), you will need to be careful about how you specify your memory request.

Often, the gaol is to get an amount of memory per task, rather than in total or per assigned node. Unfortunately, Slurm does not have a --mem-per-task flag, so this is not easy to do.

Instead, we recommend taking advantage of the --mem-per-cpu flag.

For example, let's say that you are running 4 tasks, need 24 gigabytes of memory per task, and request 6 CPUs per task. We can divide the amount of memory by the number of CPUs, to get the following:

--ntasks 4 --cpus-per-task 6 --mem-per-cpu 4gb

The reason this is necessary is that multiple tasks might land on the same node. If you used --mem 24gb instead, and 2 tasks land on the same node, those 2 tasks would share 24 gigabytes instead of getting 24 gigabytes each.

caution

The default unit is megabytes (mb), so be sure to specify gb if you intend to ask for gigabytes of memory.

Job Name

Users can give their jobs a unique name to make it easier to find them in reports.

Use flag --job-name (or uppercase -J for short) and provide the desired name. Note that these names should not include spaces.

For example, to use name dice_roll_simulation:

--job-name dice_roll_simulation

GPU Count

By default, no GPU resources are assigned to jobs. Users can ask for GPU resources by specifying the number of GPUs they need and (optionally) the desired model of GPU.

Use flag --gpus or lowercase -G for short, and pass either:

the number of GPUs you want, of any type
the type of GPU you want and the number of GPUs you want, separated by a colon (:)

For example, if you want two GPUs of any type:

--gpus 2

Or, if you want a single NVIDIA A100 GPU:

--gpus a100:1

What GPU models are available?

Palmetto has a wide variety of GPUs available. D

Here is a list of currently available GPU types and their codes:

NVIDIA V100: v100
NVIDIA A100: a100
NVIDIA H100: h100
NVIDIA L40S: l40s

Users are encouraged to review the hardware table to see what GPU models are available on which nodes.

What if I need a specific VRAM variant of a particular GPU model?

While several nodes in the cluster may have the same model of GPU, they may have different amounts of VRAM.

For example, we have some NVIDIA A100s with 80 GB of VRAM and some NVIDIA A100s with 40 GB of VRAM.

See the GPU VRAM Variant feature constraint for more information.

What if I am running multiple tasks?

If you are running multiple tasks (where ntasks > 0), then you may want more control over how your allocated GPU resources are assigned.

You can use --gpus-per-task to instruct the scheduler how many GPUs to assign to each task. The arguments format is the same as --gpus.

Important: you must still specify --gpus to tell the scheduler how many GPUs you want total for the job.

For example, if I am running two tasks, want two NVIDIA A100 GPUs, and one GPU for each task:

--ntasks 2 --gpus a100:2 --gpus-per-task a100:1

Partition Target

Users can specify which partition they would like to submit their job to. Partitions are analogous to queues.

If you have access to purchased nodes on the cluster, submitting to the associated owner partition will grant your job priority over general partition jobs.

Use flag --partition (or lowercase -p for short) and pass the name of the partition.

How can I find out what partitions I have access to?

You can run the command below to see a list of partitions you have access to:

sinfo --Format PartitionName

Unless you have access to an owner's priority partition, you will likely only see work1.

Output File Location

Feature Constraints

Advanced users of Palmetto may need to request a very specific type of hardware. Feature constraints allow you to narrow down what resources your job is eligible to run on.

Use flag --constraint or uppercase -C for short, then pass a comma-delimited list of features that you require.

For example, say you need a node with interconnect type FDR and an Intel CPU:

--constraint interconnect_fdr,chip_manufacturer_intel

caution

While many options in Slurm are common across compute clusters, there is no standard convention for feature values. If you are copying code from elsewhere that uses feature constraints, make sure you are using the correct feature values from below for Palmetto.

Interconnects

To specify which interconnect you want, the following features are available:

Interconnect Type	Feature Value
1 Gigabit Ethernet	`interconnect_1ge`
10 Gigabit Ethernet	`interconnect_10ge`
25 Gigabit Ethernet	`interconnect_25ge`
100 Gigabit Ethernet	`interconnect_100ge`
HDR InfiniBand (100 Gbps)	`interconnect_hdr` or `interconnect_100g`
FDR InfiniBand (56 Gbps)	`interconnect_fdr` or `interconnect_56g`

CPU Manufacturer

To specify which CPU manufacturer you want, the following features are available:

CPU Manufacturer	Feature Value
Intel	`chip_manufacturer_intel`
AMD	`chip_manufacturer_amd`

CPU Generation

CPU Generation	Feature Value
AMD Genoa	`cpu_gen_genoa`
AMD Milan	`cpu_gen_milan`
AMD Rome	`cpu_gen_rome`
Intel Broadwell	`cpu_gen_broadwell`
Intel Cascade Lake	`cpu_gen_cascadelake`
Intel Haswell	`cpu_gen_haswell`
Intel Ice Lake	`cpu_gen_icelake`
Intel Sapphire Rapids	`cpu_gen_sapphirerapids`
Intel Sky Lake	`cpu_gen_skylake`

GPU VRAM Variant

To specify which VRAM variant of a particular GPU you want, the following features are available:

GPU Model	VRAM Amount	Feature Value
NVIDIA A100	40 GB	`gpu_a100_40gb`
NVIDIA A100	80 GB	`gpu_a100_80gb`

info

Specifying the feature flag does not request a GPU. For example, if you specify --constraint gpu_a100_80gb, but do not also specify --gpu a100:1, then you would not receive a GPU.

The feature constraint simply filters out which VRAM variant of a particular GPU your job is eligible to run on.

GPU Interconnects

GPU Interconnect Type	Feature Value
NVLink	`gpu_interconnect_nvlink`
PCI Express (PCIe)	`gpu_interconnect_pcie`

info

If you do not specify a constraint, the default is PCI Express (PCIe).

Task and Node Count

For advanced users, you may need access to multiple chunks of compute resources to complete your job. In particular, this is useful for MPI.

Use flag --ntasks or lowercase -n for short. Pass the number of tasks you want to run as an argument.

For example, if I want to run four tasks (MPI processes):

--ntasks 4

info

Multiple tasks can land on the same or different compute nodes depending on the available resources.

To specify how many nodes you want, you can use flag--nodes along with --ntasks.

For example, to run four tasks across two different compute nodes:

--nodes 2 --ntasks 4

info

It is possible the four tasks (MPI processes) are not allocated evenly on the two nodes.

To allocate the four tasks (MPI processes) evenly on the two nodes, you can use flag --nodes together with --tasks-per-node.

For example, to run four tasks across two different compute nodes, with two tasks on each node:

--nodes 2 --tasks-per-node 2

For the same job with the four tasks (MPI processes) allocated evenly on the two nodes, if you want to allocate 6 threads for each task, you can use flag --cpus-per-task.

--nodes 2 --tasks-per-node 2 --cpus-per-task 6

This usually works together with the OMP_NUM_THREADS environment variable, which should be set to the same number of --cpus-per-task by:

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

The $SLURM_CPUS_PER_TASK is a Slurm environment variable. More Slurm environment variables can be found in the next section.

OpenMPI

On Slurm cluster, the prebuilt OpenMPI (both version 4.1.5 and 4.1.6) were compiled with pmi2 support enabled. Therefore, if your application is compiled using the prebuilt OpenMPI, your applications can be launched directly using the srun command. The resources, such as the number of nodes, number of MPI processes, number of threads for each MPI process, memories, etc will be determined automatically according to your input for parameters --nodes, --ntasks-per-node, --cpus-per-task, --mem, etc.

Intel MPI

On Slurm cluster, it is also recommended to use the srun command if the application is built against Intel MPI. This method is the best integrated with Slurm and supports process tracking, accounting, task affinity, suspend/resume and other features. However, Intel MPI does not link directly to any external PMI implementations, so the users need to point manually PMI2 library. This can be done by the command:

export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so

Then the user can launch the application by the srun command. The resources, such as the number of nodes, number of MPI processes, number of threads for each MPI process, memories, etc will be determined automatically according to your input for parameters --nodes, --ntasks-per-node, --cpus-per-task, --mem, etc.

srun ./your_app_name

Available Environment Variables in a Slurm Job

The table below shows common Slurm environment variables. For a full list the Slurm environment variables, please visit the official document.

Variable Name	Description
`$SLURM_JOB_ID`	Job id
`$SLURM_JOB_NAME`	Job name
`$SLURM_SUBMIT_DIR`	Submitting directory
`$SLURM_JOB_NODELIST` or `srun hostname`	Nodes allocated to the job
`$SLURM_NTASKS`	Total number of tasks or `MPI` processes (NOTE: not total number of cores if `--cpus-per-task` is not 1)
`$SLURM_CPUS_PER_TASK`	Number of CPU cores for each task or `MPI` process

Job Submission in Slurm

Slurm Job Submission Commands​

The salloc command​

The sbatch command​

The srun command​

Slurm Job Submission Flags​

Wall Time​

CPU Core Count​

Memory​

Job Name​

GPU Count​

Partition Target​

Output File Location​

Feature Constraints​

Interconnects​

CPU Manufacturer​

CPU Generation​

GPU VRAM Variant​

GPU Interconnects​

Task and Node Count​

OpenMPI​

Intel MPI​

Available Environment Variables in a Slurm Job​

Slurm Job Submission Commands

The `salloc` command

The `sbatch` command

The `srun` command

Slurm Job Submission Flags

Wall Time

CPU Core Count

Memory

Job Name

GPU Count

Partition Target

Output File Location

Feature Constraints

Interconnects

CPU Manufacturer

CPU Generation

GPU VRAM Variant

GPU Interconnects

Task and Node Count

OpenMPI

Intel MPI

Available Environment Variables in a Slurm Job