Skip to main content

Job Submission in Slurm

Now that you understand the basic types of jobs, you are ready to submit a job.

When you submit a new job, the scheduler places your job in the queue until it can find available resources to assign.

Slurm Job Submission Commands

Slurm has three different job submission commands: salloc, sbatch, and srun. Users will need to understand the difference between these commands to determine which command is best suited to the task they are trying to accomplish.

The salloc command

The salloc command is used to request resource allocation for interactive jobs.

salloc [JOB-SUBMISSION-FLAGS]

By default, without additional options, salloc will request one compute node, one CPU core, and one gigabyte of memory for thirty minutes. To customize your resource request, you can pass one or more job submission flags (such as --time or --mem) to specify what you need.

Once you submit your salloc job, it will follow this life cycle:

  • If the scheduler accepts your job, salloc will pause while your job waits in the queue for resources to be allocated.
  • When the job is allocated, salloc will provide an interactive shell session where you can type commands.
  • When your shell session exits, salloc will automatically mark your job as completed and release the resources for use by other jobs.

To learn more, you can read the salloc manual page.

Example usage of salloc

Let's say that we need the following for a job:

  • one node
  • four CPU cores
  • 8 GB of memory
  • 2 hours of wall time

The command to start this example job would be:

salloc --cpus-per-task 4 --mem 8GB --time 02:00:00

Example output of salloc:

salloc: Pending job allocation 2113
salloc: job 2113 queued and waiting for resources
salloc: job 2113 has been allocated resources
salloc: Granted job allocation 2113
salloc: Nodes node0405 are ready for job

The sbatch command

The sbatch command is used to submit batch jobs that will run in the background.

sbatch [JOB-SUBMISSION-FLAGS] [SCRIPT-FILE]

Users must pass the path to a shell script (typically a .sh file) that will be invoked by sbatch when your job starts. The script must specify both the resource requirements for the job and what commands to run.

Your shell script should use special #SBATCH comments at the top of the file to specify resource requirements. You can use any of the standard job submission flags. If desired, you may override flags specified in the script file by passing them to sbatch before the script file name.

Below your #SBATCH comments, you should specify one or more commands that you would like to run during your task. We strongly recommend using srun to run commands in your batch script to make debugging easier.

Once you submit your sbatch job, it will follow this life cycle:

  • If the scheduler accepts your job, sbatch will output the Job ID for your newly created job.
  • After submission, you may disconnect from Palmetto and work on other things.
  • When the job is allocated, the shell script that you provided to sbatch will be executed.
  • When your shell script exits, the job will be marked as completed and resources will be released for use by other jobs.

Job output is saved to a file on disk by default. See the control page for instructions on how to view job output during or after execution.

To learn more, you can read the sbatch manual page.

Example usage of sbatch

First, we will need to create a shell script for our job. These can be created with any text editor (like vim or nano) and must be stored on the cluster.

For this example, we will create myjob.sh in our home directory with the following contents:

#!/bin/bash

#SBATCH --job-name my-job-name
#SBATCH --nodes 2
#SBATCH --tasks-per-node 2
#SBATCH --cpus-per-task 8
#SBATCH --mem 12gb
#SBATCH --time 02:00:00
#SBATCH --gpus-per-node a100:1

srun python3 run-my-science-workflow.py

Notice the special #SBATCH comment lines, which will determine what resources our job requires. To learn more about each of these flags, please review the job submission flags section.

To submit this job, you would run the following command:

sbatch myjob.sh

Example output of sbatch:

sbatch: Submit checks complete!
Submitted batch job 2349

The srun command

The srun command is typically used to launch programs as a parallel job within a batch job submitted by sbatch.

srun [JOB-SUBMISSION-FLAGS] [COMMAND-TO-RUN] [FLAGS-FOR-COMMAND-TO-RUN]
note

srun may also be used outside of sbatch scripts to request other parallel job allocations, which is not covered on this page.

Using srun provides better integration with the Slurm scheduler and can help with debugging individual steps of a job. This is especially helpful when you are running more than one task (--ntasks greater than 1), possibly even across multiple nodes (--nodes greater than 1).

When used inside of an sbatch job script, commands passed to srun are run on all nodes/tasks in the job by default. You can alter this behavior using the --ntasks and --nodes flags.

If you would like to further subdivide the resources requested by your job, you can specify additional job submission flags.

To learn more, you can read the srun manual page.

Basic example usage of srun within an sbatch job with multiple nodes

Let's say we have submitted a batch job using sbatch myjob.sh and the myjob.sh file has the following contents:

#!/bin/bash

#SBATCH --ntasks 3
#SBATCH --nodes 3

srun --nodes 1 python3 server.py
srun --nodes 2 python3 worker.py

This batch job will request 3 nodes and then uses srun to execute a server program on one and a worker program on two others.

Slurm Job Submission Flags

When submitting a job, you will need to use a combination of flags to specify which resources your job needs.

You can specify just the flags you need. For example, if you are not using a GPU, you do not need to specify --gpus-per-task 0.

Recommended Flags

At a minimum, we recommend that you specify wall time, CPU count, and memory.

Wall Time

Wall time is the maximum amount of time your job should take to complete. This amount is measured in real time (like a wall clock), which is different from CPU time.

The timer begins when your job starts executing, so the time your job spends waiting in the queue is not included in your wall time limit. Once the timer expires, your job will be terminated if it is still running.

Use flag --time (or lowercase -t for short), and format your request as hh:mm:ss.

For example, to ask for 80 hours, 39 minutes, and 20 seconds:

--time 80:39:20
Wall Time Limits

To help us better allocate the limited pool of shared resources, there is a maximum amount of wall time that users may request for a job. The limit varies based on job source and the destination partition you submit to.

Job SourceGeneral PartitionsOwner Partitions
Batch Job Submission (via sbatch)72 hours (3 days)336 hours (14 days)
Interactive Job Submission (via salloc)12 hours336 hours (14 days)
OnDemand Interactive App Session12 hours336 hours (14 days)

CPU Core Count

By default, jobs will only receive one CPU core per task.

Use flag --cpus-per-task (or lowercase -c for short).

For example, to ask for 7 CPU cores:

--cpus-per-task 7

Memory

There are many ways to specify the amount of memory your job requires in Slurm.

The best route will depend on how many tasks your job requests.

This recommendation applies only to single-task jobs (where ntasks is not specified or ntasks=1).

Use flag --mem and provide the amount of memory you need.

For example, to ask for 2.5 gigabytes:

--mem 2.5gb
caution

The default unit is megabytes (mb), so be sure to specify gb if you intend to ask for gigabytes of memory.

Job Name

Users can give their jobs a unique name to make it easier to find them in reports.

Use flag --job-name (or uppercase -J for short) and provide the desired name. Note that these names should not include spaces.

For example, to use name dice_roll_simulation:

--job-name dice_roll_simulation

Email Notifications

Users can request email notifications on events. The flag to enable this is:

--mail-type <type>

Type can be one or more of the following:

  • NONE: no emails (default)
  • BEGIN: job began
  • END: job ended
  • FAIL: job failed
  • REQUEUE: job dequeued
  • ALL: the same as BEGIN,END,FAIL,INVALID_DEPEND,REQUEUE,STAGE_OUT
  • INVALID_DEPEND: dependency never satisfied
  • TIME_LIMIT: reached time limit
  • TIME_LIMIT_90: reached 90% of time limit
  • TIME_LIMIT_80: reached 80% of time limit
  • TIME_LIMIT_50: reached 50% of time limit
  • ARRAY_TASKS: send email for each array task

GPU Count

By default, no GPU resources are assigned to jobs. Users can ask for GPU resources by specifying the number of GPUs they need and (optionally) the desired model of GPU.

Use flag --gpus or lowercase -G for short, and pass either:

  • the number of GPUs you want, of any type
  • the type of GPU you want and the number of GPUs you want, separated by a colon (:)

For example, if you want two GPUs of any type:

--gpus 2

Or, if you want a single NVIDIA A100 GPU:

--gpus a100:1
What GPU models are available?

Palmetto has a wide variety of GPUs available. D

Here is a list of currently available GPU types and their codes:

  • NVIDIA V100: v100
  • NVIDIA A100: a100
  • NVIDIA H100: h100
  • NVIDIA L40S: l40s

Users are encouraged to review the hardware table to see what GPU models are available on which nodes.

What if I need a specific VRAM variant of a particular GPU model?

While several nodes in the cluster may have the same model of GPU, they may have different amounts of VRAM.

For example, we have some NVIDIA A100s with 80 GB of VRAM and some NVIDIA A100s with 40 GB of VRAM.

See the GPU VRAM Variant feature constraint for more information.

What if I am running multiple tasks?

If you are running multiple tasks (where ntasks > 0), then you may want more control over how your allocated GPU resources are assigned.

You can use --gpus-per-task to instruct the scheduler how many GPUs to assign to each task. The arguments format is the same as --gpus.

Important: you must still specify --gpus to tell the scheduler how many GPUs you want total for the job.

For example, if I am running two tasks, want two NVIDIA A100 GPUs, and one GPU for each task:

--ntasks 2 --gpus a100:2 --gpus-per-task a100:1

Partition Target

Users can specify which partition they would like to submit their job to. Partitions are analogous to queues.

If you have access to purchased nodes on the cluster, submitting to the associated owner partition will grant your job priority over general partition jobs.

Use flag --partition (or lowercase -p for short) and pass the name of the partition.

How can I find out what partitions I have access to?

You can run the command below to see a list of partitions you have access to:

sinfo --Format PartitionName

Unless you have access to an owner's priority partition, you will likely only see work1.

Output File Location

Feature Constraints

Advanced users of Palmetto may need to request a very specific type of hardware. Feature constraints allow you to narrow down what resources your job is eligible to run on.

Use flag --constraint or uppercase -C for short, then pass a comma-delimited list of features that you require.

For example, say you need a node with interconnect type FDR and an Intel CPU:

--constraint interconnect_fdr,chip_manufacturer_intel
caution

While many options in Slurm are common across compute clusters, there is no standard convention for feature values. If you are copying code from elsewhere that uses feature constraints, make sure you are using the correct feature values from below for Palmetto.

Interconnects

To specify which interconnect you want, the following features are available:

Interconnect TypeFeature Value
1 Gigabit Ethernetinterconnect_1ge
10 Gigabit Ethernetinterconnect_10ge
25 Gigabit Ethernetinterconnect_25ge
100 Gigabit Ethernetinterconnect_100ge
HDR InfiniBand (100 Gbps)interconnect_hdr or interconnect_100g
FDR InfiniBand (56 Gbps)interconnect_fdr or interconnect_56g

CPU Manufacturer

To specify which CPU manufacturer you want, the following features are available:

CPU ManufacturerFeature Value
Intelchip_manufacturer_intel
AMDchip_manufacturer_amd

CPU Generation

CPU GenerationFeature Value
AMD Genoacpu_gen_genoa
AMD Milancpu_gen_milan
AMD Romecpu_gen_rome
Intel Broadwellcpu_gen_broadwell
Intel Cascade Lakecpu_gen_cascadelake
Intel Haswellcpu_gen_haswell
Intel Ice Lakecpu_gen_icelake
Intel Sapphire Rapidscpu_gen_sapphirerapids
Intel Sky Lakecpu_gen_skylake

CPU Chip Model

If you need a specific model of CPU, then you can specify a chip_type feature.

CPU ManufacturerCPU GenerationCPU ModelCore CountFeature Value
AMDGenoaEPYC 965496chip_type_9654
AMDMilanEPYC 754332chip_type_7543
AMDMilanEPYC 7713P64chip_type_7713p
AMDRomeEPYC 774264chip_type_7742
IntelBroadwellXeon E5-2640 v410chip_type_e5-2640v4
IntelBroadwellXeon E5-2680 v48chip_type_e5-2680v4
IntelBroadwellXeon E5-4627 v410chip_type_e5-4627v4
IntelCascade LakeXeon Gold 6230R26chip_type_6230r
IntelCascade LakeXeon Gold 6238R28chip_type_6238r
IntelCascade LakeXeon Gold 624018chip_type_6240
IntelCascade LakeXeon Gold 624820chip_type_6248g
IntelCascade LakeXeon Gold 625224chip_type_6252g
IntelCascade LakeXeon Gold 6258R28chip_type_6258r
IntelHaswellXeon E5-2680 v312chip_type_e5-2680v3
IntelHaswellXeon E5-269820chip_type_e5-2698
IntelIce LakeXeon Gold 634828chip_type_6348g
IntelIce LakeXeon Platinum 835832chip_type_8358
IntelIce LakeXeon Platinum 8360Y36chip_type_8360y
IntelIce LakeXeon Platinum 836232chip_type_8362
IntelIvy BridgeXeon E5-2650 v28chip_type_e5-2650v2
IntelIvy BridgeXeon E5-2670 v210chip_type_e5-2670v2
IntelIvy BridgeXeon E5-3670 v210chip_type_e5-3670v2
IntelSandyXeon E5-46408chip_type_e5-4640
IntelSandy BridgeXeon E5-26658chip_type_e5-2665
IntelSapphire RapidsXeon Platinum 8462Y+32chip_type_8462Y+
IntelSapphire RapidsXeon Platinum 847052chip_type_8470
IntelSapphire RapidsXeon Platinum 8480CL56chip_type_8480cl
IntelSky LakeXeon Gold 614820chip_type_6148g

X11/Graphical Software Capability

To specify that your job requires a node capable of running software with a graphical user interface (GUI), you can use the graphics feature.

Advanced Vector Extensions

To specify that your software requires a CPU capable of executing Advanced Vector Extensions (AVX) instructions, use the following features:

CapabilityFeature Value
AVXextension_avx
AVX2 (256-bit)extension_avx2
AVX-512 (512-bit)extension_avx512

GPU VRAM Variant

To specify which VRAM variant of a particular GPU you want, the following features are available:

GPU ModelVRAM AmountFeature Value
NVIDIA A10040 GBgpu_a100_40gb
NVIDIA A10080 GBgpu_a100_80gb
info

Specifying the feature flag does not request a GPU. For example, if you specify --constraint gpu_a100_80gb, but do not also specify --gpu a100:1, then you would not receive a GPU.

The feature constraint simply filters out which VRAM variant of a particular GPU your job is eligible to run on.

GPU Interconnects

GPU Interconnect TypeFeature Value
NVLinkgpu_interconnect_nvlink
PCI Express (PCIe)gpu_interconnect_pcie
info

If you do not specify a constraint, the default is PCI Express (PCIe).

Task and Node Count

For advanced users, you may need access to multiple chunks of compute resources to complete your job. In particular, this is useful for MPI.

Use flag --ntasks or lowercase -n for short. Pass the number of tasks you want to run as an argument.

For example, if I want to run four tasks (MPI processes):

--ntasks 4
info

Multiple tasks can land on the same or different compute nodes depending on the available resources.

To specify how many nodes you want, you can use flag--nodes along with --ntasks.

For example, to run four tasks across two different compute nodes:

--nodes 2 --ntasks 4
info

It is possible the four tasks (MPI processes) are not allocated evenly on the two nodes.

To allocate the four tasks (MPI processes) evenly on the two nodes, you can use flag --nodes together with --tasks-per-node.

For example, to run four tasks across two different compute nodes, with two tasks on each node:

--nodes 2 --tasks-per-node 2

For the same job with the four tasks (MPI processes) allocated evenly on the two nodes, if you want to allocate 6 threads for each task, you can use flag --cpus-per-task.

--nodes 2 --tasks-per-node 2 --cpus-per-task 6

This usually works together with the OMP_NUM_THREADS environment variable, which should be set to the same number of --cpus-per-task by:

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

The $SLURM_CPUS_PER_TASK is a Slurm environment variable. More Slurm environment variables can be found in the next section.

OpenMPI

On Slurm cluster, the prebuilt OpenMPI (both version 4.1.5 and 4.1.6) were compiled with pmi2 support enabled. Therefore, if your application is compiled using the prebuilt OpenMPI, your applications can be launched directly using the srun command. The resources, such as the number of nodes, number of MPI processes, number of threads for each MPI process, memories, etc will be determined automatically according to your input for parameters --nodes, --ntasks-per-node, --cpus-per-task, --mem, etc.

Intel MPI

On Slurm cluster, it is also recommended to use the srun command if the application is built against Intel MPI. This method is the best integrated with Slurm and supports process tracking, accounting, task affinity, suspend/resume and other features. However, Intel MPI does not link directly to any external PMI implementations, so the users need to point manually PMI2 library. This can be done by the command:

export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so

Then the user can launch the application by the srun command. The resources, such as the number of nodes, number of MPI processes, number of threads for each MPI process, memories, etc will be determined automatically according to your input for parameters --nodes, --ntasks-per-node, --cpus-per-task, --mem, etc.

srun ./your_app_name

Available Environment Variables in a Slurm Job

The table below shows common Slurm environment variables. For a full list the Slurm environment variables, please visit the official document.

Variable NameDescription
$SLURM_JOB_IDJob id
$SLURM_JOB_NAMEJob name
$SLURM_SUBMIT_DIRSubmitting directory
$SLURM_JOB_NODELIST or srun hostnameNodes allocated to the job
$SLURM_NTASKSTotal number of tasks or MPI processes (NOTE: not total number of cores if --cpus-per-task is not 1)
$SLURM_CPUS_PER_TASKNumber of CPU cores for each task or MPI process