Checkpointing with DMTCP

DMTCP (Distributed MultiThreaded CheckPointing) is a transparent checkpointing tool. It is designed to be able to checkpoint software, even if the software was not designed to be checkpointed.

WIP

These checkpointing instructions are a work-in-progress, and these methods are still being tested. Expect changes, and be sure to test checkpointing and restoring on shorter jobs before expecting they will work on longer jobs.

Known Limitations

DMTCP will not work for all jobs. It is not expected to work with:

Jobs using GPUs
Jobs using Apptainer without builtin checkpointing enabled.

It should work, but the scripts and information below are currently untested for:

MPI jobs
Multi-node jobs

Install

Currently, this software is not installed cluster-wide. Instead, install through spack:

module load spack
spack install dmtcp
spack load dmtcp

Usage

Overview

DMTCP uses a coordinator process to manage the checkpointing and restoring. Each job task must then be launched with the dmtcp_launch wrapper. This wrapper communicates with the coordinator whenever a checkpoint is requested.

Interactive

Manual Checkpoints

The easiest way to use DMTCP interactively within a job is to have two terminals open to the compute node. Start an interactive job with:

terminal 1
salloc

This will be "terminal 1". Then connect to that job from another terminal using:

terminal 2
srun --overlap --jobid <jobid> --pty bash --login

Be sure to replace <jobid> with the job ID of the job started with salloc. This will be "terminal 2".

Load dmtcp in both terminals:

terminal 1 and 2
module load spack
spack load dmtcp

In terminal 1, setup a checkpoint directory by setting the DMTCP_CHECKPOINT_DIR environment variable. This should be an empty directory (create a new one if needed using mkdir). Also load Anaconda so we can test with an interactive python session.

terminal 1
export DMTCP_CHECKPOINT_DIR=/scratch/$USER/interactive-checkpoint-test
mkdir $DMTCP_CHECKPOINT_DIR        # Create the directory if needed.

module load anaconda3/2023.09-0

We can then launch our work using dmtcp_launch. Let's start Python interactively:

terminal 1
dmtcp_launch python3

Let's setup some variables so we can check saved state. Run the following in the Python terminal (terminal 1).

terminal 1
test1 = "important data"
test2 = 12345
print(test1)
print(test2)

Terminal 1 session

bash-4.4$ salloc
salloc: Verifying interactive job limits...
salloc: Setting partition to interact...
salloc: Queue: interact
salloc: Submit checks complete!
salloc: Pending job allocation 193631
salloc: job 193631 queued and waiting for resources
salloc: job 193631 has been allocated resources
salloc: Granted job allocation 193631
salloc: Waiting for resource configuration
salloc: Nodes node0495 are ready for job
bash-4.4$ module load spack
bash-4.4$ spack load dmtcp
bash-4.4$ export DMTCP_CHECKPOINT_DIR=/scratch/$USER/interactive-checkpoint-test
bash-4.4$ mkdir $DMTCP_CHECKPOINT_DIR
bash-4.4$ dmtcp_launch python3
Python 3.6.8 (default, Oct 23 2023, 19:59:56)
[GCC 8.5.0 20210514 (Red Hat 8.5.0-18)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> test1 = "important data"
>>> test2 = 12345
>>> print(test1)
important data
>>> print(test2)
12345
>>>

We can then test checkpointing by running the following command in terminal 2:

terminal 2
dmtcp_command --bcheckpoint

This command should have created some checkpoint files. You should see them in the checkpoint directory:

terminal 2
ls -l /scratch/$USER/interactive-checkpoint-test

Let's test a checkpoint. Exit the current python interpreter by running:

terminal 1
exit()

Then try loading the checkpoints using:

terminal 1
dmtcp_restart $DMTCP_CHECKPOINT_DIR/*.dmtcp

You may see some warnings as DMTCP restarts the process that you can ignore. If all restored properly, you should have your python session up and running again as needed. You should be able to print the variables and see they are restored:

example terminal 1 session
bash-4.4$ dmtcp_restart $DMTCP_CHECKPOINT_DIR/*.dmtcp
[270393] mtcp_restart.c:347 restore_brk:
  error: new/current break (0x11410000) != saved break (0x55f6f5d8d000)
[2024-06-23T22:33:14.041, 40000, 40002, NOTE] at processinfo.cpp:436 in restoreHeap; REASON='Failed to restore area between saved_break and curr_break.'
     _savedBrk = 94519174942720
     curBrk = 289472512
     (strerror((*__errno_location ()))) = Cannot allocate memory
>>> print(test1)
important data
>>> print(test2)
12345
>>>

Automatic Checkpoints at Interval

You can also have DMTCP checkpoint automatically at a regular interval. To do this, pass --interval <number-seconds> to dmtcp_launch and dmtcp_restart.

For example, if we want to automatically checkpoint our python shell every 10 seconds, we could run:

terminal 1
export DMTCP_CHECKPOINT_DIR=/scratch/$USER/interactive-checkpoint-test2
mkdir $DMTCP_CHECKPOINT_DIR        # Create the directory if needed.

dmtcp_launch --interval 10 python3

Then you could exit (run exit()) and restart it whenever you wanted, and it should pick up at the most recent checkpoint. In this following example, it checkpointed right after I typed the e in exit(), so when it was restored and I pressed enter, it had an e is not defined error. All the variables were still restored:

example terminal 1 session
bash-4.4$ mkdir $DMTCP_CHECKPOINT_DIR        # Create the directory if needed.
bash-4.4$ dmtcp_launch --interval 10 python3
Python 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> var1 = "important data set loaded"
>>> var2 = "another important dataset"
>>> var3 = "important results"
>>> exit()
bash-4.4$ dmtcp_restart --interval 10 $DMTCP_CHECKPOINT_DIR/*.dmtcp
[2024-06-23T22:49:08.385, 40000, 40002, NOTE] at processinfo.cpp:426 in restoreHeap; REASON='Area between saved_break and curr_break not mapped, mapping it now'
     _savedBrk = 14934016
     curBrk = 289472512

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'e' is not defined
>>> print (var1)
important data set loaded
>>> print (var2)
another important dataset
>>> print (var3)
important results
>>>
>>>
>>> exit()

Batch

To demonstrate the checkpointing below, we'll assume there is an R script, compute_pi.R available in your working directory. This script takes about 8 minutes or more to execute.

compute_pi.R

Create a file compute_pi.R in your home directory.

compute_pi.R
# Number of random points to generate
num_points <- 500000000

# Estimate Pi using Monte Carlo simulation
inside_circle <- 0
for (i in 1:num_points) {
    x <- runif(1, 0, 1) # Generate random x-coordinate
    y <- runif(1, 0, 1) # Generate random y-coordinate

    # Check if the point is inside the quarter-circle
    if (x^2 + y^2 <= 1) {
        inside_circle <- inside_circle + 1
    }
}

# Estimate Pi
pi_estimate <- 4 * inside_circle / num_points
cat("Estimated Pi:", pi_estimate, "\n")
cat("Used ", num_points, " points.\n")

The simplest approach to checkpointing and restoring for a batch job is to have two separate scripts: one for initial launching, and the other for restarting.

The initial script could look something like:

compute_pi_launch.sh
#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --mem=1gb
#SBATCH --time=00:04:00
#SBATCH --job-name=compute_pi_test

module load spack
spack load dmtcp
module load r/4.4.0

export DMTCP_CHECKPOINT_DIR=/scratch/$USER/compute-pi-checkpoint
mkdir "$DMTCP_CHECKPOINT_DIR"
dmtcp_launch --interval 60 Rscript compute_pi.R

This script will:

Load dmtcp (and spack).
Load r/4.4.0 (needed to run the R script).
Set up a checkpoint directory (note: this director must be empty for the first run)
Use DMTCP to launch the R script and checkpoint every 60 seconds.

The restart script could look something like:

compute_pi_restart.sh
#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --mem=1gb
#SBATCH --time=00:04:00
#SBATCH --job-name=compute_pi_test

module load spack
spack load dmtcp

export DMTCP_CHECKPOINT_DIR=/scratch/$USER/compute-pi-checkpoint
dmtcp_restart --interval 60 $DMTCP_CHECKPOINT_DIR/*.dmtcp

This script will:

Load dmtcp (and spack).
Configures the checkpoint directory.
Use DMTCP to restart the checkpoint files, and then continue to checkpoint every 60 seconds.
Not load r/4.4.0. The checkpoint files already include the initial environment (all modules that were initially loaded).

To run a job, then, you would first submit the launch script:

sbatch compute_pi_launch.sh

Then you'd have to check back when the job died. If the job didn't complete, you'd run the submit a job using script to have it continue:

sbatch compute_pi_restart.sh

This process is pretty straight forward, but is a bit cumbersome. If we are willing to increase the complexity of our submit scripts, we can can write a single batch script that:

Automatically detects whether this is the first time the job is launched, or the if there are already checkpoints to restore from.
Instructs Slurm signal the job when it is close to running out of time, and when it receives this signal, it can:
1. Trigger a checkpoint.
2. Re-queue itself as a new job.
3. Exit, killing the workload.
Still periodically checkpoint (the job ending signal will not be provided when a job is preempted or if there is a hardware failure, etc).

Because it checkpoints and re-submits it self, it should completely finish performing the computations (across multiple jobs) without any interaction needed. If the job is interrupted (preempted or hardware failure), it will still have checkpointed periodically, and if you manually re-submit the job it will pickup where it left off.

compute_pi.sh
#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --mem=1gb
#SBATCH --time=00:04:00		# Run for 4 minutes
#SBATCH --signal=B:USR1@30 	# Send USR1 signal 30 seconds before the job end.
#SBATCH --job-name=compute_pi_test
#SBATCH --output=compute_pi_out.txt
#SBATCH --open-mode=append

# This simple log function will annotate output with date and job ID.
log () {
	echo "$(date) jobid=$SLURM_JOB_ID " "$@"
}
log "Job starting"

module load spack
spack load dmtcp
module load r/4.4.0

# This bash function will handle the USR1 signal that happens before the job
# finishes, checkpointing the job and queuing another job to resume the work.
handle_job_end () {
	log "Job is ending. Checkpointing..."
	dmtcp_command --bcheckpoint
	log "Checkpoint complete."
	log "Spawning new job..."
	sbatch compute_pi.sh
	log "Exiting..."
	exit 0
}
# The trap command registers the handle_job_end as the handler of the USR1 signal.
trap handle_job_end SIGUSR1

# This is the folder where checkpoints are stored. You should have a spearate
# checkpoint folder for each work item (e.g. if you are creating multiple jobs
# for a parameter sweep, each should have a separate checkpoint directory).
#
# This script assumes that the directory does not exist for the first run (with
# no checkpoints) and that if it exists, it will try to restart a previous
# checkpoint.
export DMTCP_CHECKPOINT_DIR=/scratch/$USER/compute-pi-checkpoint

# Check if the checkpoint directory exists.
if [ -d "$DMTCP_CHECKPOINT_DIR" ]; then
	log "Checkpoint directory exists."
	log "Restoring from checkpoint..."

	dmtcp_restart --interval 60 $DMTCP_CHECKPOINT_DIR/*.dmtcp &
	wait

else
	log "No checkpoint directory, creating new."
	mkdir "$DMTCP_CHECKPOINT_DIR"

	# Note that the job is launched in the background (using the &) and we
	# explicitly call wait.  Bash will not handle signals while a program
	# is in the foreground, so to recieve the USR1 signal and handle the
	# checkpoint, we need our workload running in the background.
	log "Computing pi..."
	dmtcp_launch --interval 60 Rscript compute_pi.R &
	wait
fi

# This line only runs if the job ran entirely to completion.
log "Job ran to completion!"

Example Output

Sun Jun 23 15:31:04 EDT 2024 jobid=191611  Job starting
Sun Jun 23 15:31:05 EDT 2024 jobid=191611  No checkpoint directory, creating new.
Sun Jun 23 15:31:05 EDT 2024 jobid=191611  Computing pi...
Sun Jun 23 15:34:19 EDT 2024 jobid=191611  Job is ending. Checkpointing...
Sun Jun 23 15:34:20 EDT 2024 jobid=191611  Checkpoint complete.
Sun Jun 23 15:34:20 EDT 2024 jobid=191611  Spawning new job...
sbatch: Queue: work1
sbatch: Submit checks complete!
Submitted batch job 191615
Sun Jun 23 15:34:20 EDT 2024 jobid=191611  Exiting...
Sun Jun 23 15:34:21 EDT 2024 jobid=191615  Job starting
Sun Jun 23 15:34:22 EDT 2024 jobid=191615  Checkpoint directory exists.
Sun Jun 23 15:34:22 EDT 2024 jobid=191615  Restoring from checkpoint...
[2024-06-23T15:34:25.276, 40000, 40004, NOTE] at processinfo.cpp:426 in restoreHeap; REASON='Area between saved_break and curr_break not mapped, mapping it now'
     _savedBrk = 86708224
     curBrk = 289472512
Sun Jun 23 15:37:50 EDT 2024 jobid=191615  Job is ending. Checkpointing...
Sun Jun 23 15:37:54 EDT 2024 jobid=191615  Checkpoint complete.
Sun Jun 23 15:37:54 EDT 2024 jobid=191615  Spawning new job...
sbatch: Queue: work1
sbatch: Submit checks complete!
Submitted batch job 191617
Sun Jun 23 15:37:55 EDT 2024 jobid=191615  Exiting...
Sun Jun 23 15:37:56 EDT 2024 jobid=191617  Job starting
Sun Jun 23 15:37:57 EDT 2024 jobid=191617  Checkpoint directory exists.
Sun Jun 23 15:37:57 EDT 2024 jobid=191617  Restoring from checkpoint...
Estimated Pi: 3.141537
Used  5e+08  points.
Sun Jun 23 15:40:19 EDT 2024 jobid=191617  Job ran to completion!

If you would like to tailor this for your own jobs, here are the changes you should consider:

Update resource requests (cpu cores, memory).
Update max walltime as needed.
Increase the amount of time before the job ends that Slurm sends the USR1 signal. The compute_pi job is pretty lightweight. For jobs that use more memory, checkpointing can take longer. If your job is 72 hours, you may want to send the signal at 30 minutes, or even an hour before the job completes.
Update script and job name.
Update loading any modules or activating any environments (replace module load r/4.4.0).
Update execution call.
Update checkpoint directory.

Known Limitations​

Install​

Usage​

Overview​

Interactive​

Manual Checkpoints​

Automatic Checkpoints at Interval​

Batch​