Skip to main content

Check Availability of Slurm Nodes

There are several methods to figure out what hardware is available, busy, and offline in the cluster.

Using the online hardware table

You can visit the Hardware Table page on the documentation website to get a live view of the cluster's hardware availability in your browser.

tip

The in-browser hardware table automatically updates once per minute if you are logged in!

Using sinfo

The Slurm standard sinfo can be used to check current cluster status and node availability.

Partition Summary

To generate a row per partition with summary availability information, use the following command.

sinfo -s

Example output:

PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
work1 up infinite 3/18/0/21 node[0400-0401,0403-0419,0421,1036]

The NODES(A/I/O/T) column lists the nodes Allocated (in-use), Idle, Other, and Total.

Per Node Configuration Summary

To generate a row per each node configuration (CPU, memory, and GPU count) with summary availability information, use the following command.

sinfo -eO "CPUs:8,Memory:9,Gres:14,NodeAIOT:16,NodeList:50"

Example output:

CPUS    MEMORY   GRES          NODES(A/I/O/T)  NODELIST
64 256831 gpu:a100:2 4/16/0/20 node[0400-0401,0403-0419,0421]
128 1031142 gpu:mi210:1 0/1/0/1 node1036

The CPU column provides the total core count, the Memory column provides the number of megabytes of memory, the GRES column provides any GPUs, and the NODES(A/I/O/T) column lists the nodes Allocated (in-use), Idle, Other, and Total.

Checking Nodes Partially In Use

Although a node may be listed as allocated, there may be some resources left. This puts the node in a "mixed" state. We can check nodes in this state and see how much CPU and GPUs are available with the following command.

sinfo -NO "CPUs:8,CPUsState:16,Memory:9,AllocMem:10,Gres:14,GresUsed:24,NodeList:50" -t mixed

Example output:

CPUS    CPUS(A/I/O/T)   MEMORY   ALLOCMEM  GRES          GRES_USED               NODELIST
64 20/44/0/64 256831 4096 gpu:a100:2 gpu:a100:0(IDX:N/A) node0403
64 45/19/0/64 256831 4096 gpu:a100:2 gpu:a100:2(IDX:0-1) node0404
64 61/3/0/64 256831 8192 gpu:a100:2 gpu:a100:1(IDX:0) node0405

This output provided the number of number of cores by state in the format "allocated/idle/other/total" in the column CPUS(A/I/O/T). It also shows the memory already in use (ALLOCMEM) and GPUs in use (GRES_USED).

In this example, the first node has 44 cores, 252GB memory (256GB - 4GB), and 2 GPUs free. The second has 19 cores, 252GB memory, and no GPUs free. The third has 3 cores, 248GB memory, and 1 GPU free.

This command only lists nodes in a mixed state, it will not show jobs that are completely idle or busy.

Estimating Job Start Time

When a job is queued, you can see its estimated start time by passing the --start flag to squeue. For example:

[dndawso@slogin001 ~]$ squeue --start --job 3103
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
3103 work1 interact dndawso PD 2023-12-05T23:07:12 19 (null) (Resources)

Here I see that job ID 3103 is currently pending (state is PD) and the reason is waiting for resources. The scheduler predicts that the job should start December 5th at 11:07:12pm. This is subject to change if higher priory jobs are submitted.

You can also estimate how long it would take before a job starts without submitting by passing --test-only to srun or sbatch. For example:

[dndawso@slogin001 jobperf-batch-test]$ sbatch --test-only run2.sh
sbatch: Job 3104 to start at 2023-12-05T23:07:12 using 19 processors on nodes node[0401,0403-0419,0421] in partition work1

In this example, the job was not actually submitted and queued, but I was told that the scheduler predicts that the job should start December 5th at 11:07:12pm. This is subject to change if higher priory jobs are submitted.