About the Job Scheduler

With over a thousand compute nodes, the Palmetto 2 cluster has a vast amount of resources available. However, to make efficient use of these resources, we need a way to divide and share them amongst all of our users.

This is the job scheduler's principal task – to keep track of which resources are free/busy and distribute access to our users in a fair manner.

Introducing Slurm

The Palmetto 2 cluster uses the Slurm Workload Manager to manage shared resources and schedule jobs. It is widely used on several other HPC clusters and provides modern features that make it easier for both users and administrators to work with.

Limits

To keep the cluster stable and ensure available resources for all users, the following limits apply.

Keep in mind that these are just limits applicable to the job scheduler. Other limits apply, including (but not limited to) the Acceptable Use Guidelines and Storage Quotas.

Job Submission Limit

Users may have no more than 3000 jobs in the pending or running state (combined) at one time.

See the Lifecycle of a Slurm Job to learn more about job states.

Slurm Request Rate Limit

Slurm uses the concept of "buckets" to limit how many commands a user can send to the controller. Each user has a bucket size of 1000, and each request subtracts from their bucket. This bucket is intentionally configured to refill slowly (2 requests per second) to allow for workloads that may burst submit jobs while encouraging slower behavior.

If your bucket is empty when you make a request, your command will be blocked until your bucket refills. Depending on the frequency of your requests, this may result in some commands failing or taking longer than expected. If you expect to have a workload that bursts, it is recommended to account for, and handle this possible error in your script.

warning

Running Slurm commands in a loop will not only exhaust your bucket, but will put extra load on the controller that may interfere with its performance. Do not put Slurm commands in a loop without significant delays (60s or more).

To better monitor your job's status, please use Email Notifications.

Queue

Slurm only has one queue ordered by priority. We use Slurm's partition system to help separate this one queue into varying parts.

Priority

A job's priority is calculated based on a few different configured parameters in Slurm, the most important parameters being fair share, time in queue, partition, and job size.

When a job gets submitted, it is given an initial priority based on the factors listed above, usually putting the job towards the end of the queue, though this initial position can vary from job to job.

As a job sits in queue, it should gain priority over time as the "age" of the job increases, though In some circumstances, the priority of a job may stay the same or decrease. This can be attributed to how "fair share" works, since if a user has one or more jobs running, their "fair share" may go down compared to other users in the queue.

A job's priority can be checked with the command sprio.

info

Priority only applies when there is contention for resources, such as when the cluster is very busy.

If there are sufficient available resources to run your job, a low priority value will not prevent your job from running.

Partition Order

There is a hierarchy of partitions on palmetto.The following is that hierarchy and some information about each type:

Owner Partitions
- Jobs in this partition will preempt other jobs.
- Jobs submitted here will be put at the top of the queue.
- These partitions are limited on resources based on the number of nodes purchased.
- This tier includes the MERCURY Consortium's Skylight.
interact partition
- Any srun, salloc, or Open OnDemand job will be automatically routed here
- Jobs in this partition will not preempt other jobs.
- Jobs submitted here will have a slightly increased priority since the user is expected to be interacting with their job.
- Slurm will automatically direct your job here if it is determined to be an interactive job.
- Interactive jobs have a max wall time of 12 hours, and will be automatically lowered to 12 hours if the job request exceeds that max.
work1 partition
- Any sbatch job will be automatically routed here
- Jobs in this partition will not preempt other jobs.
- Jobs submitted here do not get increased priority.
- Work1 is the default partition for any batch job.
osg (Open Science Grid) partition
- Jobs submitted in this partition will not preempt other jobs.
- Jobs submitted here do not get increased priority
- This partition is limited to a smaller subset of Palmetto 2

Fair share is calculated based on the user's resource usage compared to other users on the cluster.

Users with higher-than-average utilization will have lower fair share weight values, which reduces the priority of their jobs. Conversely, users with lower-than-average utilization will have higher fair share weight values, which increases the priority of their jobs.

A user's fair share can be checked with the sshare command. A fair share value of more than 0.5 means a users is using less than their share, and a value of less than 0.5 means a users is using more than their share.

tip

If your fair share value is low, your jobs may sit in the queue for longer, but will still run eventually.

Two factors ensure that jobs will not get stuck in queue indefinitely, even if their fair share weight is low:

a job's priority increases the longer it sits in the queue, as explained in the priority section
over time, as your existing jobs end and your future jobs sit in queue, your overall utilization will decrease and your fair share will increase

For more information on how priority is calculated, see the Slurm documentation on fair share.

Scheduling Process

Slurm has two scheduling loops, the main scheduling loop and the backfill scheduling loop. Both loops look at jobs in the order of their priority, but schedule slightly differently.

Main Scheduling Loop

The main scheduling loop schedules jobs based on their priority value and makes node and start time estimates based on the wall time of currently running jobs.

Summary of main scheduler process:

The main scheduling loop runs often, about once per minute.
Running jobs are evaluated first to determine which resources are available or busy.
Pending jobs are evaluated in priority order.
- If there are enough resources available, the main scheduler will schedule the job.
- If there are not enough resources remaining due to higher-priority jobs, the job is skipped.
The main scheduling loop continues until either:
- All jobs are evaluated
- The main scheduler job count limit is reached
- The scheduling loop time limit is reached

note

Note that lower-priority jobs will only be skipped by the main scheduling loop when there are higher-priority jobs with similar or overlapping resource requests. These jobs could still be scheduled during the backfill scheduling loop.

For example, the priority of a job requesting A100 GPUs will not affect the scheduling of a job requesting V100 GPUs.

Example of Main Scheduling

Let's look at a simplified example of how main scheduling works.

For the example, we will assume a homogeneous cluster where all nodes have 1 CPU core and 1 GB of memory, and assume that all jobs ask for full nodes. The real main scheduler will consider other resource dimensions, including wall time, memory, GPUs, and more.

Our simple example cluster has four nodes total and three nodes are busy. This means that only one node is available.

Three jobs are currently pending in the queue:

Job #1 requests two nodes and has a priority of 5000.
Job #2 requests one node and has a priority of 3000.
Job #3 requests one node and has a priority of 2000.

When the main scheduler runs, Job #1 will not be able to run because there are not enough free resources, so it will not be scheduled. Jobs #2 and #3 cannot be scheduled because higher priority jobs are waiting for the same resources.

Remember that the backfill scheduling loop could schedule the lower priority jobs to run during its process, so they will have another opportunity for evaluation.

Backfill Scheduling Loop

The second scheduling process of the Slurm scheduler is the backfill scheduler. The backfill scheduler is a secondary scheduling loop that is designed to optimize the utilization of computing resources by allowing shorter jobs to run ahead of longer, higher-priority jobs, as long as they don't delay the start of those higher-priority jobs.

Summary of backfill scheduler process:

Job Prioritization: It considers all running jobs and pending jobs in priority order.
Resource Allocation: It looks for available resources that can be used without delaying higher-priority jobs.
Efficiency: By filling in gaps with shorter jobs, it maximizes resource usage and reduces idle times

note

There is a maximum number of jobs the backfill scheduler will look at in total, per partition, and per user to help prevent queue stuffing and allow the largest number of users a chance for their job to run.

Example of Backfill Scheduling

Let's look a simplified example of how backfill scheduling works. For the example, we will only consider CPU cores and wall time. The real backfill scheduler will also consider other trackable resources, such as memory and GPUs.

Our example cluster has 2 nodes, which each have 2 CPU cores. Let's say that we have the following jobs have been submitted to the cluster:

Job 1: submitted at 9 AM; currently running; asks for 1 node, 2 CPU cores for 1 hour
Job 2: submitted at 9:30 AM; currently running; asks for 1 node, 1 CPU core for 2 hours
Job 3: submitted at 9:45 AM; currently running; asks for 1 node for 1 hour
Job 4: submitted at 9:55 AM; currently running; asks for 1 node for 1 hour
Job 5: submitted at 10:10 AM; queued; resources planned; asks for 1 node, 2 CPU cores for 1.5 hours
Job 6: submitted at 10:12 AM; queued; resources planned; asks for 1 node, 1 CPU core for 75 minutes
Job 7: submitted at 10:13 AM; queued; resources planned; asks for 1 node, 2 CPU cores for 1 hour
Job 8: submitted at 10:20 AM; currently running; backfilled; asks for 1 node, 1 CPU core for .5 hours

Even though Job 8 was submitted after the rest of the jobs, Slurm was still able to backfill the job into a time-slot on Node A since the job could fill those idle resources and end before Job 5 is planned to start.

Preemption

Since Palmetto 2 uses the condo model, some Clemson faculty (and other institutions) have purchased nodes to guarantee priority and availability on Palmetto 2. Any jobs submitted to these owner partitions can preempt any non-owner job on the cluster. If an interactive or Open On Demand job gets preempted, it will be canceled since interactive jobs cannot be restarted. If a batch job gets preempted, it will be requeued.

note

By default, requeue is enabled for batch jobs on Palmetto 2. If your workflow does not support re-queueing after preemption, add the --no-requeue flag to your batch script.

Slurm will always attempt to start an owner job without preempting. However, if preemption is unavoidable, Slurm will attempt to preempt the fewest number of jobs possible, prioritizing lower priority jobs.

In rare occurrences, the backfill scheduler may need to preempt jobs. If this is the case, for efficiency, the backfill scheduler will preempt an entire node whether the owner job needs all of those resources or not.

Introducing Slurm​

Limits​

Job Submission Limit​

Slurm Request Rate Limit​

Queue​

Priority​

Partition Order​

Fair Share​

Scheduling Process​

Main Scheduling Loop​

Example of Main Scheduling​

Backfill Scheduling Loop​

Example of Backfill Scheduling​

Preemption​

Introducing Slurm

Limits

Job Submission Limit

Slurm Request Rate Limit

Queue

Priority

Partition Order

Fair Share

Scheduling Process

Main Scheduling Loop

Example of Main Scheduling

Backfill Scheduling Loop

Example of Backfill Scheduling

Preemption