Check Availability of Slurm Nodes

There are several methods to figure out what hardware is available, busy, and offline in the cluster.

Using the online hardware table

You can visit the Hardware Table page on the documentation website to get a live view of the cluster's hardware availability in your browser.

tip

The in-browser hardware table automatically updates once per minute if you are logged in!

Using `sinfo`

The Slurm standard sinfo can be used to check current cluster status and node availability.

Partition Summary

To generate a row per partition with summary availability information, use the following command.

sinfo -s

Example output:

PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
work1        up   infinite        3/18/0/21 node[0400-0401,0403-0419,0421,1036]

The NODES(A/I/O/T) column lists the nodes Allocated (in-use), Idle, Other, and Total.

Per Node Configuration Summary

To generate a row per each node configuration (CPU, memory, and GPU count) with summary availability information, use the following command.

sinfo -eO "CPUs:8,Memory:9,Gres:14,NodeAIOT:16,NodeList:50"

Example output:

CPUS    MEMORY   GRES          NODES(A/I/O/T)  NODELIST
64      256831   gpu:a100:2    4/16/0/20       node[0400-0401,0403-0419,0421]
128     1031142  gpu:mi210:1   0/1/0/1         node1036

The CPU column provides the total core count, the Memory column provides the number of megabytes of memory, the GRES column provides any GPUs, and the NODES(A/I/O/T) column lists the nodes Allocated (in-use), Idle, Other, and Total.

Checking Nodes Partially In Use

Although a node may be listed as allocated, there may be some resources left. This puts the node in a "mixed" state. We can check nodes in this state and see how much CPU and GPUs are available with the following command.

sinfo -NO "CPUs:8,CPUsState:16,Memory:9,AllocMem:10,Gres:14,GresUsed:24,NodeList:50" -t mixed

Example output:

CPUS    CPUS(A/I/O/T)   MEMORY   ALLOCMEM  GRES          GRES_USED               NODELIST
    20/44/0/64      256831   4096      gpu:a100:2    gpu:a100:0(IDX:N/A)     node0403
    45/19/0/64      256831   4096      gpu:a100:2    gpu:a100:2(IDX:0-1)     node0404
    61/3/0/64       256831   8192      gpu:a100:2    gpu:a100:1(IDX:0)       node0405

This output provided the number of number of cores by state in the format "allocated/idle/other/total" in the column CPUS(A/I/O/T). It also shows the memory already in use (ALLOCMEM) and GPUs in use (GRES_USED).

In this example, the first node has 44 cores, 252GB memory (256GB - 4GB), and 2 GPUs free. The second has 19 cores, 252GB memory, and no GPUs free. The third has 3 cores, 248GB memory, and 1 GPU free.

This command only lists nodes in a mixed state, it will not show jobs that are completely idle or busy.

Estimating Job Start Time

When a job is queued, you can see its estimated start time by passing the --start flag to squeue. For example:

[dndawso@slogin001 ~]$ squeue --start --job 3103
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
              3103     work1 interact  dndawso PD 2023-12-05T23:07:12     19 (null)               (Resources)

Here I see that job ID 3103 is currently pending (state is PD) and the reason is waiting for resources. The scheduler predicts that the job should start December 5th at 11:07:12pm. This is subject to change if higher priory jobs are submitted.

You can also estimate how long it would take before a job starts without submitting by passing --test-only to srun or sbatch. For example:

[dndawso@slogin001 jobperf-batch-test]$ sbatch --test-only run2.sh
sbatch: Job 3104 to start at 2023-12-05T23:07:12 using 19 processors on nodes node[0401,0403-0419,0421] in partition work1

In this example, the job was not actually submitted and queued, but I was told that the scheduler predicts that the job should start December 5th at 11:07:12pm. This is subject to change if higher priory jobs are submitted.

Using the online hardware table​

Using sinfo​

Partition Summary​

Per Node Configuration Summary​

Checking Nodes Partially In Use​

Estimating Job Start Time​

Using the online hardware table

Using `sinfo`

Partition Summary

Per Node Configuration Summary

Checking Nodes Partially In Use

Estimating Job Start Time