Skip to main content

Check Availability of Slurm Nodes

Slurm cluster is in a closed beta

The Slurm cluster is currently in a closed beta until summer 2024. If you would like to join the beta program, submit a request.

Using sinfo

The Slurm standard sinfo can be used to check current cluster status and node availability.

Partition Summary

To generate a row per partition with summary availability information, use the following command.

sinfo -s

Example output:

PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
work1 up infinite 3/18/0/21 node[0400-0401,0403-0419,0421,1036]

The NODES(A/I/O/T) column lists the nodes Allocated (in-use), Idle, Other, and Total.

Per Node Configuration Summary

To generate a row per each node configuration (CPU, memory, and GPU count) with summary availability information, use the following command.

sinfo -eO "CPUs:8,Memory:9,Gres:14,NodeAIOT:16,NodeList:50"

Example output:

CPUS    MEMORY   GRES          NODES(A/I/O/T)  NODELIST
64 256831 gpu:a100:2 4/16/0/20 node[0400-0401,0403-0419,0421]
128 1031142 gpu:mi210:1 0/1/0/1 node1036

The CPU column provides the total core count, the Memory column provides the number of megabytes of memory, the GRES column provides any GPUs, and the NODES(A/I/O/T) column lists the nodes Allocated (in-use), Idle, Other, and Total.

Checking Nodes Partially In Use

Although a node may be listed as allocated, there may be some resources left. This puts the node in a "mixed" state. We can check nodes in this state and see how much CPU and GPUs are available with the following command.

sinfo -NO "CPUs:8,CPUsState:16,Memory:9,AllocMem:10,Gres:14,GresUsed:24,NodeList:50" -t mixed

Example output:

CPUS    CPUS(A/I/O/T)   MEMORY   ALLOCMEM  GRES          GRES_USED               NODELIST
64 20/44/0/64 256831 4096 gpu:a100:2 gpu:a100:0(IDX:N/A) node0403
64 45/19/0/64 256831 4096 gpu:a100:2 gpu:a100:2(IDX:0-1) node0404
64 61/3/0/64 256831 8192 gpu:a100:2 gpu:a100:1(IDX:0) node0405

This output provided the number of number of cores by state in the format "allocated/idle/other/total" in the column CPUS(A/I/O/T). It also shows the memory already in use (ALLOCMEM) and GPUs in use (GRES_USED).

In this example, the first node has 44 cores, 252GB memory (256GB - 4GB), and 2 GPUs free. The second has 19 cores, 252GB memory, and no GPUs free. The third has 3 cores, 248GB memory, and 1 GPU free.

This command only lists nodes in a mixed state, it will not show jobs that are completely idle or busy.

Estimating Job Start Time

When a job is queued, you can see its estimated start time by passing the --start flag to squeue. For example:

[dndawso@slogin001 ~]$ squeue --start --job 3103
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
3103 work1 interact dndawso PD 2023-12-05T23:07:12 19 (null) (Resources)

Here I see that job ID 3103 is currently pending (state is PD) and the reason is waiting for resources. The scheduler predicts that the job should start December 5th at 11:07:12pm. This is subject to change if higher priory jobs are submitted.

You can also estimate how long it would take before a job starts without submitting by passing --test-only to srun or sbatch. For example:

[dndawso@slogin001 jobperf-batch-test]$ sbatch --test-only run2.sh
sbatch: Job 3104 to start at 2023-12-05T23:07:12 using 19 processors on nodes node[0401,0403-0419,0421] in partition work1

In this example, the job was not actually submitted and queued, but I was told that the scheduler predicts that the job should start December 5th at 11:07:12pm. This is subject to change if higher priory jobs are submitted.