Job Submission in Slurm
Now that you understand the basic types of jobs, you are ready to submit a job.
When you submit a new job, the scheduler places your job in the queue until it can find available resources to assign.
Slurm Job Submission Commands
Slurm has three different job submission commands: salloc
, sbatch
, and
srun
. Users will need to understand the difference between these commands to
determine which command is best suited to the task they are trying to
accomplish.
The salloc
command
The salloc
command is used to request resource allocation for
interactive jobs.
salloc [JOB-SUBMISSION-FLAGS]
By default, without additional options, salloc
will request one compute node,
one CPU core, and one gigabyte of memory for thirty minutes. To customize your
resource request, you can pass one or more
job submission flags (such as --time
or
--mem
) to specify what you need.
Once you submit your salloc
job, it will follow this life cycle:
- If the scheduler accepts your job,
salloc
will pause while your job waits in the queue for resources to be allocated. - When the job is allocated,
salloc
will provide an interactive shell session where you can type commands. - When your shell session exits,
salloc
will automatically mark your job as completed and release the resources for use by other jobs.
To learn more, you can read the
salloc
manual page.
Example usage of salloc
salloc
Let's say that we need the following for a job:
- one node
- four CPU cores
- 8 GB of memory
- 2 hours of wall time
The command to start this example job would be:
salloc --cpus-per-task 4 --mem 8GB --time 02:00:00
Example output of salloc
:
salloc: Pending job allocation 2113
salloc: job 2113 queued and waiting for resources
salloc: job 2113 has been allocated resources
salloc: Granted job allocation 2113
salloc: Nodes node0405 are ready for job
The sbatch
command
The sbatch
command is used to submit batch jobs
that will run in the background.
sbatch [JOB-SUBMISSION-FLAGS] [SCRIPT-FILE]
Users must pass the path to a shell script (typically a .sh
file) that will be
invoked by sbatch
when your job starts. The script must specify both the
resource requirements for the job and what commands to run.
Your shell script should use special #SBATCH
comments at the top of the
file to specify resource requirements. You can use any of the standard
job submission flags. If desired, you may
override flags specified in the script file by passing them to sbatch
before
the script file name.
Below your #SBATCH
comments, you should specify one or more commands that you
would like to run during your task. We strongly recommend using
srun
to run commands in your batch script to make
debugging easier.
Once you submit your sbatch
job, it will follow this life cycle:
- If the scheduler accepts your job,
sbatch
will output the Job ID for your newly created job. - After submission, you may disconnect from Palmetto and work on other things.
- When the job is allocated, the shell script that you provided to
sbatch
will be executed. - When your shell script exits, the job will be marked as completed and resources will be released for use by other jobs.
Job output is saved to a file on disk by default. See the control page for instructions on how to view job output during or after execution.
To learn more, you can read the
sbatch
manual page.
Example usage of sbatch
sbatch
First, we will need to create a shell script for our job. These can be created
with any text editor (like vim
or nano
) and must be stored on the cluster.
For this example, we will create myjob.sh
in our home directory with the
following contents:
#!/bin/bash
#SBATCH --job-name my-job-name
#SBATCH --nodes 2
#SBATCH --tasks-per-node 2
#SBATCH --cpus-per-task 8
#SBATCH --mem 12gb
#SBATCH --time 02:00:00
#SBATCH --gpus-per-node a100:1
srun python3 run-my-science-workflow.py
Notice the special #SBATCH
comment lines, which will determine what resources
our job requires. To learn more about each of these flags, please review the
job submission flags section.
To submit this job, you would run the following command:
sbatch myjob.sh
Example output of sbatch
:
sbatch: Submit checks complete!
Submitted batch job 2349
The srun
command
The srun
command is typically used to launch programs as a parallel job within
a batch job submitted by sbatch
.
srun [JOB-SUBMISSION-FLAGS] [COMMAND-TO-RUN] [FLAGS-FOR-COMMAND-TO-RUN]
srun
may also be used outside of sbatch
scripts to request other parallel
job allocations, which is not covered on this page.
Using srun
provides better integration with the Slurm scheduler and can help
with debugging individual steps of a job. This is especially helpful when you
are running more than one task (--ntasks
greater than 1), possibly even across
multiple nodes (--nodes
greater than 1).
When used inside of an sbatch
job script, commands passed to srun
are run on
all nodes/tasks in the job by default. You can alter this behavior using the
--ntasks
and --nodes
flags.
If you would like to further subdivide the resources requested by your job, you can specify additional job submission flags.
To learn more, you can read the
srun
manual page.
Basic example usage of srun
within an sbatch
job with multiple nodes
srun
within an sbatch
job with multiple nodesLet's say we have submitted a batch job using sbatch myjob.sh
and the
myjob.sh
file has the following contents:
#!/bin/bash
#SBATCH --ntasks 3
#SBATCH --nodes 3
srun --nodes 1 python3 server.py
srun --nodes 2 python3 worker.py
This batch job will request 3 nodes and then uses srun
to execute a server
program on one and a worker program on two others.
Slurm Job Submission Flags
When submitting a job, you will need to use a combination of flags to specify which resources your job needs.
You can specify just the flags you need. For example, if you are not using a
GPU, you do not need to specify --gpus-per-task 0
.
Wall Time
Wall time is the maximum amount of time your job should take to complete. This amount is measured in real time (like a wall clock), which is different from CPU time.
The timer begins when your job starts executing, so the time your job spends waiting in the queue is not included in your wall time limit. Once the timer expires, your job will be terminated if it is still running.
Use flag --time
(or lowercase -t
for short), and format your request as
hh:mm:ss
.
For example, to ask for 80 hours, 39 minutes, and 20 seconds:
--time 80:39:20
To help us better allocate the limited pool of shared resources, there is a maximum amount of wall time that users may request for a job. The limit varies based on job source and the destination partition you submit to.
Job Source | General Partitions | Owner Partitions |
---|---|---|
Batch Job Submission (via sbatch ) | 72 hours (3 days) | 336 hours (14 days) |
Interactive Job Submission (via salloc ) | 12 hours | 336 hours (14 days) |
OnDemand Interactive App Session | 12 hours | 336 hours (14 days) |
CPU Core Count
By default, jobs will only receive one CPU core per task.
Use flag --cpus-per-task
(or lowercase -c
for short).
For example, to ask for 7 CPU cores:
--cpus-per-task 7
Memory
There are many ways to specify the amount of memory your job requires in Slurm.
The best route will depend on how many tasks your job requests.
- Single-Task Jobs
- Multi-Task Jobs
This recommendation applies only to single-task jobs (where ntasks
is not
specified or ntasks=1
).
Use flag --mem
and provide the amount of memory you need.
For example, to ask for 2.5 gigabytes:
--mem 2.5gb
For advanced users with multi-task jobs (where ntasks > 0
), you will need to
be careful about how you specify your memory request.
Often, the goal is to get an amount of memory per task, rather than in total
or per assigned node. Unfortunately, Slurm does not have a --mem-per-task
flag, so this is not easy to do.
Instead, we recommend taking advantage of the --mem-per-cpu
flag.
For example, let's say that you are running 4 tasks, need 24 gigabytes of memory per task, and request 6 CPUs per task. We can divide the amount of memory by the number of CPUs, to get the following:
--ntasks 4 --cpus-per-task 6 --mem-per-cpu 4gb
The reason this is necessary is that multiple tasks might land on the same node.
If you used --mem 24gb
instead, and 2 tasks land on the same node, those 2
tasks would share 24 gigabytes instead of getting 24 gigabytes each.
The default unit is megabytes (mb
), so be sure to specify gb
if you intend
to ask for gigabytes of memory.
Job Name
Users can give their jobs a unique name to make it easier to find them in reports.
Use flag --job-name
(or uppercase -J
for short) and provide the desired
name. Note that these names should not include spaces.
For example, to use name dice_roll_simulation
:
--job-name dice_roll_simulation
Email Notifications
Users can request email notifications on events. The flag to enable this is:
--mail-type <type>
Type can be one or more of the following:
NONE
: no emails (default)BEGIN
: job beganEND
: job endedFAIL
: job failedREQUEUE
: job dequeuedALL
: the same asBEGIN,END,FAIL,INVALID_DEPEND,REQUEUE,STAGE_OUT
INVALID_DEPEND
: dependency never satisfiedTIME_LIMIT
: reached time limitTIME_LIMIT_90
: reached 90% of time limitTIME_LIMIT_80
: reached 80% of time limitTIME_LIMIT_50
: reached 50% of time limitARRAY_TASKS
: send email for each array task
GPU Count
By default, no GPU resources are assigned to jobs. Users can ask for GPU resources by specifying the number of GPUs they need and (optionally) the desired model of GPU.
Use flag --gpus
or lowercase -G
for short, and pass either:
- the number of GPUs you want, of any type
- the type of GPU you want and the number of GPUs you want, separated by a colon
(
:
)
For example, if you want two GPUs of any type:
--gpus 2
Or, if you want a single NVIDIA A100 GPU:
--gpus a100:1
What GPU models are available?
Palmetto has a wide variety of GPUs available.
Here is a list of currently available GPU types and their codes:
- NVIDIA H100:
h100
- NVIDIA L40S:
l40s
- NVIDIA A40:
a40
- NVIDIA A100:
a100
- NVIDIA V100S:
v100s
- NVIDIA V100:
v100
- NVIDIA P100:
p100
- NVIDIA K40:
k40
- NVIDIA K20:
k20
GPUs are sorted from newest to oldest. Users are encouraged to review the hardware table to see what GPU models are available on each node.
What if I need a specific VRAM variant of a particular GPU model?
While several nodes in the cluster may have the same model of GPU, they may have different amounts of VRAM.
For example, we have some NVIDIA A100s with 80 GB of VRAM and some NVIDIA A100s with 40 GB of VRAM.
See the GPU VRAM Variant feature constraint for more information.
What if I am running multiple tasks?
If you are running multiple tasks (where ntasks > 0
), then you may want more
control over how your allocated GPU resources are assigned.
You can use --gpus-per-task
to instruct the scheduler how many GPUs to assign
to each task. The arguments format is the same as --gpus
.
Important: you must still specify --gpus
to tell the scheduler how many
GPUs you want total for the job.
For example, if I am running two tasks, want two NVIDIA A100 GPUs, and one GPU for each task:
--ntasks 2 --gpus a100:2 --gpus-per-task a100:1
Partition Target
Users can specify which partition they would like to submit their job to. Partitions are analogous to queues.
If you have access to purchased nodes on the cluster, submitting to the associated owner partition will grant your job priority over general partition jobs.
Use flag --partition
(or lowercase -p
for short) and pass the name of the
partition.
How can I find out what partitions I have access to?
You can run the command below to see a list of partitions you have access to:
sinfo --Format PartitionName
Unless you have access to an owner's priority partition, you will likely only
see work1
.
Output File Location
Feature Constraints
Advanced users of Palmetto may need to request a very specific type of hardware. Feature constraints allow you to narrow down what resources your job is eligible to run on.
Use flag --constraint
or uppercase -C
for short, then pass a comma-delimited
list of features that you require.
For example, say you need a node with interconnect type FDR and an Intel CPU:
--constraint interconnect_fdr,chip_manufacturer_intel
While many options in Slurm are common across compute clusters, there is no standard convention for feature values. If you are copying code from elsewhere that uses feature constraints, make sure you are using the correct feature values from below for Palmetto.
Interconnects
To specify which interconnect you want, the following features are available:
Interconnect Type | Feature Value |
---|---|
1 Gigabit Ethernet | interconnect_1ge |
10 Gigabit Ethernet | interconnect_10ge |
25 Gigabit Ethernet | interconnect_25ge |
100 Gigabit Ethernet | interconnect_100ge |
HDR InfiniBand (100 Gbps) | interconnect_hdr or interconnect_100g |
FDR InfiniBand (56 Gbps) | interconnect_fdr or interconnect_56g |
CPU Manufacturer
To specify which CPU manufacturer you want, the following features are available:
CPU Manufacturer | Feature Value |
---|---|
Intel | chip_manufacturer_intel |
AMD | chip_manufacturer_amd |
CPU Generation
CPU Generation | Feature Value |
---|---|
AMD Genoa | cpu_gen_genoa |
AMD Milan | cpu_gen_milan |
AMD Rome | cpu_gen_rome |
Intel Broadwell | cpu_gen_broadwell |
Intel Cascade Lake | cpu_gen_cascadelake |
Intel Haswell | cpu_gen_haswell |
Intel Ice Lake | cpu_gen_icelake |
Intel Sapphire Rapids | cpu_gen_sapphirerapids |
Intel Sky Lake | cpu_gen_skylake |
CPU Chip Model
If you need a specific model of CPU, then you can specify a chip_type
feature.
CPU Manufacturer | CPU Generation | CPU Model | Core Count | Feature Value |
---|---|---|---|---|
AMD | Genoa | EPYC 9654 | 96 | chip_type_9654 |
AMD | Milan | EPYC 7543 | 32 | chip_type_7543 |
AMD | Milan | EPYC 7713P | 64 | chip_type_7713p |
AMD | Rome | EPYC 7742 | 64 | chip_type_7742 |
Intel | Broadwell | Xeon E5-2640 v4 | 10 | chip_type_e5-2640v4 |
Intel | Broadwell | Xeon E5-2680 v4 | 8 | chip_type_e5-2680v4 |
Intel | Broadwell | Xeon E5-4627 v4 | 10 | chip_type_e5-4627v4 |
Intel | Cascade Lake | Xeon Gold 6230R | 26 | chip_type_6230r |
Intel | Cascade Lake | Xeon Gold 6238R | 28 | chip_type_6238r |
Intel | Cascade Lake | Xeon Gold 6240 | 18 | chip_type_6240 |
Intel | Cascade Lake | Xeon Gold 6248 | 20 | chip_type_6248g |
Intel | Cascade Lake | Xeon Gold 6252 | 24 | chip_type_6252g |
Intel | Cascade Lake | Xeon Gold 6258R | 28 | chip_type_6258r |
Intel | Haswell | Xeon E5-2680 v3 | 12 | chip_type_e5-2680v3 |
Intel | Haswell | Xeon E5-2698 | 20 | chip_type_e5-2698 |
Intel | Ice Lake | Xeon Gold 6348 | 28 | chip_type_6348g |
Intel | Ice Lake | Xeon Platinum 8358 | 32 | chip_type_8358 |
Intel | Ice Lake | Xeon Platinum 8360Y | 36 | chip_type_8360y |
Intel | Ice Lake | Xeon Platinum 8362 | 32 | chip_type_8362 |
Intel | Ivy Bridge | Xeon E5-2650 v2 | 8 | chip_type_e5-2650v2 |
Intel | Ivy Bridge | Xeon E5-2670 v2 | 10 | chip_type_e5-2670v2 |
Intel | Ivy Bridge | Xeon E5-3670 v2 | 10 | chip_type_e5-3670v2 |
Intel | Sandy | Xeon E5-4640 | 8 | chip_type_e5-4640 |
Intel | Sandy Bridge | Xeon E5-2665 | 8 | chip_type_e5-2665 |
Intel | Sapphire Rapids | Xeon Platinum 8462Y+ | 32 | chip_type_8462Y+ |
Intel | Sapphire Rapids | Xeon Platinum 8470 | 52 | chip_type_8470 |
Intel | Sapphire Rapids | Xeon Platinum 8480CL | 56 | chip_type_8480cl |
Intel | Sky Lake | Xeon Gold 6148 | 20 | chip_type_6148g |
X11/Graphical Software Capability
To specify that your job requires a node capable of running software with a
graphical user interface (GUI), you can use the graphics
feature.
Advanced Vector Extensions
To specify that your software requires a CPU capable of executing Advanced Vector Extensions (AVX) instructions, use the following features:
Capability | Feature Value |
---|---|
AVX | extension_avx |
AVX2 (256-bit) | extension_avx2 |
AVX-512 (512-bit) | extension_avx512 |
GPU VRAM Variant
To specify which VRAM variant of a particular GPU you want, the following features are available:
GPU Model | VRAM Amount | Feature Value |
---|---|---|
NVIDIA A100 | 40 GB | gpu_a100_40gb |
NVIDIA A100 | 80 GB | gpu_a100_80gb |
Specifying the feature flag does not request a GPU. For example, if you
specify --constraint gpu_a100_80gb
, but do not also specify --gpu a100:1
,
then you would not receive a GPU.
The feature constraint simply filters out which VRAM variant of a particular GPU your job is eligible to run on.
GPU Interconnects
GPU Interconnect Type | Feature Value |
---|---|
NVLink | gpu_interconnect_nvlink |
PCI Express (PCIe) | gpu_interconnect_pcie |
If you do not specify a constraint, the default is PCI Express (PCIe).
Task and Node Count
For advanced users, you may need access to multiple chunks of compute resources to complete your job. In particular, this is useful for MPI.
Use flag --ntasks
or lowercase -n
for short. Pass the number of tasks you
want to run as an argument.
For example, if I want to run four tasks (MPI processes):
--ntasks 4
Multiple tasks can land on the same or different compute nodes depending on the available resources.
To specify how many nodes you want, you can use flag--nodes
along with
--ntasks
.
For example, to run four tasks across two different compute nodes:
--nodes 2 --ntasks 4
It is possible the four tasks (MPI processes) are not allocated evenly on the two nodes.
To allocate the four tasks (MPI processes) evenly on the two nodes, you can use
flag --nodes
together with --tasks-per-node
.
For example, to run four tasks across two different compute nodes, with two tasks on each node:
--nodes 2 --tasks-per-node 2
For the same job with the four tasks (MPI processes) allocated evenly on the two
nodes, if you want to allocate 6 threads for each task, you can use flag
--cpus-per-task
.
--nodes 2 --tasks-per-node 2 --cpus-per-task 6
This usually works together with the OMP_NUM_THREADS
environment variable,
which should be set to the same number of --cpus-per-task
by:
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
The $SLURM_CPUS_PER_TASK
is a Slurm environment variable. More Slurm
environment variables can be found in the
next section.
OpenMPI
On Slurm cluster, the prebuilt OpenMPI
(both version 4.1.5 and 4.1.6) were
compiled with pmi2 support enabled. Therefore, if your application is compiled
using the prebuilt OpenMPI
, your applications can be launched directly using
the srun
command. The resources, such as the number of nodes, number of MPI
processes, number of threads for each MPI process, memories, etc will be
determined automatically according to your input for parameters --nodes
,
--ntasks-per-node
, --cpus-per-task
, --mem
, etc.
Intel MPI
On Slurm cluster, it is also recommended to use the srun
command if the
application is built against Intel MPI
. This method is the best integrated
with Slurm and supports process tracking, accounting, task affinity,
suspend/resume and other features. However, Intel MPI
does not link directly
to any external PMI implementations, so the users need to point manually PMI2
library. This can be done by the command:
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so
Then the user can launch the application by the srun
command. The resources,
such as the number of nodes, number of MPI processes, number of threads for each
MPI process, memories, etc will be determined automatically according to your
input for parameters --nodes
, --ntasks-per-node
, --cpus-per-task
, --mem
,
etc.
srun ./your_app_name
Available Environment Variables in a Slurm Job
The table below shows common Slurm environment variables. For a full list the Slurm environment variables, please visit the official document.
Variable Name | Description |
---|---|
$SLURM_JOB_ID | Job id |
$SLURM_JOB_NAME | Job name |
$SLURM_SUBMIT_DIR | Submitting directory |
$SLURM_JOB_NODELIST or srun hostname | Nodes allocated to the job |
$SLURM_NTASKS | Total number of tasks or MPI processes (NOTE: not total number of cores if --cpus-per-task is not 1) |
$SLURM_CPUS_PER_TASK | Number of CPU cores for each task or MPI process |