Check Availability of Slurm Nodes

Slurm cluster is in a closed beta

The Slurm cluster is currently in a closed beta until summer 2024. If you would like to join the beta program, submit a request.

Using `sinfo`

The Slurm standard sinfo can be used to check current cluster status and node availability.

Partition Summary

To generate a row per partition with summary availability information, use the following command.

sinfo -s

Example output:

PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
work1        up   infinite        3/18/0/21 node[0400-0401,0403-0419,0421,1036]

The NODES(A/I/O/T) column lists the nodes Allocated (in-use), Idle, Other, and Total.

Per Node Configuration Summary

To generate a row per each node configuration (CPU, memory, and GPU count) with summary availability information, use the following command.

sinfo -eO "CPUs:8,Memory:9,Gres:14,NodeAIOT:16,NodeList:50"

Example output:

CPUS    MEMORY   GRES          NODES(A/I/O/T)  NODELIST
64      256831   gpu:a100:2    4/16/0/20       node[0400-0401,0403-0419,0421]
128     1031142  gpu:mi210:1   0/1/0/1         node1036

The CPU column provides the total core count, the Memory column provides the number of megabytes of memory, the GRES column provides any GPUs, and the NODES(A/I/O/T) column lists the nodes Allocated (in-use), Idle, Other, and Total.

Checking Nodes Partially In Use

Although a node may be listed as allocated, there may be some resources left. This puts the node in a "mixed" state. We can check nodes in this state and see how much CPU and GPUs are available with the following command.

sinfo -NO "CPUs:8,CPUsState:16,Memory:9,AllocMem:10,Gres:14,GresUsed:24,NodeList:50" -t mixed

Example output:

CPUS    CPUS(A/I/O/T)   MEMORY   ALLOCMEM  GRES          GRES_USED               NODELIST
    20/44/0/64      256831   4096      gpu:a100:2    gpu:a100:0(IDX:N/A)     node0403
    45/19/0/64      256831   4096      gpu:a100:2    gpu:a100:2(IDX:0-1)     node0404
    61/3/0/64       256831   8192      gpu:a100:2    gpu:a100:1(IDX:0)       node0405

This output provided the number of number of cores by state in the format "allocated/idle/other/total" in the column CPUS(A/I/O/T). It also shows the memory already in use (ALLOCMEM) and GPUs in use (GRES_USED).

In this example, the first node has 44 cores, 252GB memory (256GB - 4GB), and 2 GPUs free. The second has 19 cores, 252GB memory, and no GPUs free. The third has 3 cores, 248GB memory, and 1 GPU free.

This command only lists nodes in a mixed state, it will not show jobs that are completely idle or busy.

Estimating Job Start Time

When a job is queued, you can see its estimated start time by passing the --start flag to squeue. For example:

[dndawso@slogin001 ~]$ squeue --start --job 3103
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
              3103     work1 interact  dndawso PD 2023-12-05T23:07:12     19 (null)               (Resources)

Here I see that job ID 3103 is currently pending (state is PD) and the reason is waiting for resources. The scheduler predicts that the job should start December 5th at 11:07:12pm. This is subject to change if higher priory jobs are submitted.

You can also estimate how long it would take before a job starts without submitting by passing --test-only to srun or sbatch. For example:

[dndawso@slogin001 jobperf-batch-test]$ sbatch --test-only run2.sh
sbatch: Job 3104 to start at 2023-12-05T23:07:12 using 19 processors on nodes node[0401,0403-0419,0421] in partition work1

In this example, the job was not actually submitted and queued, but I was told that the scheduler predicts that the job should start December 5th at 11:07:12pm. This is subject to change if higher priory jobs are submitted.

Check Availability of Slurm Nodes

Using sinfo​

Partition Summary​

Per Node Configuration Summary​

Checking Nodes Partially In Use​

Estimating Job Start Time​

Using `sinfo`

Partition Summary

Per Node Configuration Summary

Checking Nodes Partially In Use

Estimating Job Start Time