Checkpointing with DMTCP
DMTCP (Distributed MultiThreaded CheckPointing) is a transparent checkpointing tool. It is designed to be able to checkpoint software, even if the software was not designed to be checkpointed.
These checkpointing instructions are a work-in-progress, and these methods are still being tested. Expect changes, and be sure to test checkpointing and restoring on shorter jobs before expecting they will work on longer jobs.
Known Limitations
DMTCP will not work for all jobs. It is not expected to work with:
- Jobs using GPUs
- Jobs using Apptainer without builtin checkpointing enabled.
It should work, but the scripts and information below are currently untested for:
- MPI jobs
- Multi-node jobs
Install
Currently, this software is not installed cluster-wide. Instead, install through spack:
module load spack
spack install dmtcp
spack load dmtcp
Usage
Overview
DMTCP uses a coordinator process to manage the checkpointing and restoring.
Each job task must then be launched with the dmtcp_launch
wrapper. This
wrapper communicates with the coordinator whenever a checkpoint is requested.
Interactive
Manual Checkpoints
The easiest way to use DMTCP interactively within a job is to have two terminals open to the compute node. Start an interactive job with:
salloc
This will be "terminal 1". Then connect to that job from another terminal using:
srun --overlap --jobid <jobid> --pty bash --login
Be sure to replace <jobid>
with the job ID of the job started with salloc
.
This will be "terminal 2".
Load dmtcp
in both terminals:
module load spack
spack load dmtcp
In terminal 1, setup a checkpoint directory by setting the
DMTCP_CHECKPOINT_DIR
environment variable. This should be an empty directory
(create a new one if needed using mkdir
). Also load Anaconda so we can test
with an interactive python session.
export DMTCP_CHECKPOINT_DIR=/scratch/$USER/interactive-checkpoint-test
mkdir $DMTCP_CHECKPOINT_DIR # Create the directory if needed.
module load anaconda3/2023.09-0
We can then launch our work using dmtcp_launch
. Let's start Python
interactively:
dmtcp_launch python3
Let's setup some variables so we can check saved state. Run the following in the Python terminal (terminal 1).
test1 = "important data"
test2 = 12345
print(test1)
print(test2)
Terminal 1 session
bash-4.4$ salloc
salloc: Verifying interactive job limits...
salloc: Setting partition to interact...
salloc: Queue: interact
salloc: Submit checks complete!
salloc: Pending job allocation 193631
salloc: job 193631 queued and waiting for resources
salloc: job 193631 has been allocated resources
salloc: Granted job allocation 193631
salloc: Waiting for resource configuration
salloc: Nodes node0495 are ready for job
bash-4.4$ module load spack
bash-4.4$ spack load dmtcp
bash-4.4$ export DMTCP_CHECKPOINT_DIR=/scratch/$USER/interactive-checkpoint-test
bash-4.4$ mkdir $DMTCP_CHECKPOINT_DIR
bash-4.4$ dmtcp_launch python3
Python 3.6.8 (default, Oct 23 2023, 19:59:56)
[GCC 8.5.0 20210514 (Red Hat 8.5.0-18)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> test1 = "important data"
>>> test2 = 12345
>>> print(test1)
important data
>>> print(test2)
12345
>>>
We can then test checkpointing by running the following command in terminal 2:
dmtcp_command --bcheckpoint
This command should have created some checkpoint files. You should see them in the checkpoint directory:
ls -l /scratch/$USER/interactive-checkpoint-test
Let's test a checkpoint. Exit the current python interpreter by running:
exit()
Then try loading the checkpoints using:
dmtcp_restart $DMTCP_CHECKPOINT_DIR/*.dmtcp
You may see some warnings as DMTCP restarts the process that you can ignore. If all restored properly, you should have your python session up and running again as needed. You should be able to print the variables and see they are restored:
bash-4.4$ dmtcp_restart $DMTCP_CHECKPOINT_DIR/*.dmtcp
[270393] mtcp_restart.c:347 restore_brk:
error: new/current break (0x11410000) != saved break (0x55f6f5d8d000)
[2024-06-23T22:33:14.041, 40000, 40002, NOTE] at processinfo.cpp:436 in restoreHeap; REASON='Failed to restore area between saved_break and curr_break.'
_savedBrk = 94519174942720
curBrk = 289472512
(strerror((*__errno_location ()))) = Cannot allocate memory
>>> print(test1)
important data
>>> print(test2)
12345
>>>
Automatic Checkpoints at Interval
You can also have DMTCP checkpoint automatically at a regular interval. To do
this, pass --interval <number-seconds>
to dmtcp_launch
and dmtcp_restart
.
For example, if we want to automatically checkpoint our python shell every 10 seconds, we could run:
export DMTCP_CHECKPOINT_DIR=/scratch/$USER/interactive-checkpoint-test2
mkdir $DMTCP_CHECKPOINT_DIR # Create the directory if needed.
dmtcp_launch --interval 10 python3
Then you could exit (run exit()
) and restart it whenever you wanted, and it
should pick up at the most recent checkpoint. In this following example, it
checkpointed right after I typed the e
in exit()
, so when it was restored
and I pressed enter, it had an e
is not defined error. All the variables were
still restored:
bash-4.4$ mkdir $DMTCP_CHECKPOINT_DIR # Create the directory if needed.
bash-4.4$ dmtcp_launch --interval 10 python3
Python 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> var1 = "important data set loaded"
>>> var2 = "another important dataset"
>>> var3 = "important results"
>>> exit()
bash-4.4$ dmtcp_restart --interval 10 $DMTCP_CHECKPOINT_DIR/*.dmtcp
[2024-06-23T22:49:08.385, 40000, 40002, NOTE] at processinfo.cpp:426 in restoreHeap; REASON='Area between saved_break and curr_break not mapped, mapping it now'
_savedBrk = 14934016
curBrk = 289472512
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'e' is not defined
>>> print (var1)
important data set loaded
>>> print (var2)
another important dataset
>>> print (var3)
important results
>>>
>>>
>>> exit()
Batch
To demonstrate the checkpointing below, we'll assume there is an R script,
compute_pi.R
available in your working directory. This script takes about 8
minutes or more to execute.
compute_pi.R
Create a file compute_pi.R
in your home directory.
# Number of random points to generate
num_points <- 500000000
# Estimate Pi using Monte Carlo simulation
inside_circle <- 0
for (i in 1:num_points) {
x <- runif(1, 0, 1) # Generate random x-coordinate
y <- runif(1, 0, 1) # Generate random y-coordinate
# Check if the point is inside the quarter-circle
if (x^2 + y^2 <= 1) {
inside_circle <- inside_circle + 1
}
}
# Estimate Pi
pi_estimate <- 4 * inside_circle / num_points
cat("Estimated Pi:", pi_estimate, "\n")
cat("Used ", num_points, " points.\n")
The simplest approach to checkpointing and restoring for a batch job is to have two separate scripts: one for initial launching, and the other for restarting.
The initial script could look something like:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --mem=1gb
#SBATCH --time=00:04:00
#SBATCH --job-name=compute_pi_test
module load spack
spack load dmtcp
module load r/4.4.0
export DMTCP_CHECKPOINT_DIR=/scratch/$USER/compute-pi-checkpoint
mkdir "$DMTCP_CHECKPOINT_DIR"
dmtcp_launch --interval 60 Rscript compute_pi.R
This script will:
- Load
dmtcp
(andspack
). - Load
r/4.4.0
(needed to run the R script). - Set up a checkpoint directory (note: this director must be empty for the first run)
- Use DMTCP to launch the R script and checkpoint every 60 seconds.
The restart script could look something like:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --mem=1gb
#SBATCH --time=00:04:00
#SBATCH --job-name=compute_pi_test
module load spack
spack load dmtcp
export DMTCP_CHECKPOINT_DIR=/scratch/$USER/compute-pi-checkpoint
dmtcp_restart --interval 60 $DMTCP_CHECKPOINT_DIR/*.dmtcp
This script will:
- Load
dmtcp
(andspack
). - Configures the checkpoint directory.
- Use DMTCP to restart the checkpoint files, and then continue to checkpoint every 60 seconds.
- Not load
r/4.4.0
. The checkpoint files already include the initial environment (all modules that were initially loaded).
To run a job, then, you would first submit the launch script:
sbatch compute_pi_launch.sh
Then you'd have to check back when the job died. If the job didn't complete, you'd run the submit a job using script to have it continue:
sbatch compute_pi_restart.sh
This process is pretty straight forward, but is a bit cumbersome. If we are willing to increase the complexity of our submit scripts, we can can write a single batch script that:
- Automatically detects whether this is the first time the job is launched, or the if there are already checkpoints to restore from.
- Instructs Slurm signal the job when it is close to running out of time, and
when it receives this signal, it can:
- Trigger a checkpoint.
- Re-queue itself as a new job.
- Exit, killing the workload.
- Still periodically checkpoint (the job ending signal will not be provided when a job is preempted or if there is a hardware failure, etc).
Because it checkpoints and re-submits it self, it should completely finish performing the computations (across multiple jobs) without any interaction needed. If the job is interrupted (preempted or hardware failure), it will still have checkpointed periodically, and if you manually re-submit the job it will pickup where it left off.
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --mem=1gb
#SBATCH --time=00:04:00 # Run for 4 minutes
#SBATCH --signal=B:USR1@30 # Send USR1 signal 30 seconds before the job end.
#SBATCH --job-name=compute_pi_test
#SBATCH --output=compute_pi_out.txt
#SBATCH --open-mode=append
# This simple log function will annotate output with date and job ID.
log () {
echo "$(date) jobid=$SLURM_JOB_ID " "$@"
}
log "Job starting"
module load spack
spack load dmtcp
module load r/4.4.0
# This bash function will handle the USR1 signal that happens before the job
# finishes, checkpointing the job and queuing another job to resume the work.
handle_job_end () {
log "Job is ending. Checkpointing..."
dmtcp_command --bcheckpoint
log "Checkpoint complete."
log "Spawning new job..."
sbatch compute_pi.sh
log "Exiting..."
exit 0
}
# The trap command registers the handle_job_end as the handler of the USR1 signal.
trap handle_job_end SIGUSR1
# This is the folder where checkpoints are stored. You should have a spearate
# checkpoint folder for each work item (e.g. if you are creating multiple jobs
# for a parameter sweep, each should have a separate checkpoint directory).
#
# This script assumes that the directory does not exist for the first run (with
# no checkpoints) and that if it exists, it will try to restart a previous
# checkpoint.
export DMTCP_CHECKPOINT_DIR=/scratch/$USER/compute-pi-checkpoint
# Check if the checkpoint directory exists.
if [ -d "$DMTCP_CHECKPOINT_DIR" ]; then
log "Checkpoint directory exists."
log "Restoring from checkpoint..."
dmtcp_restart --interval 60 $DMTCP_CHECKPOINT_DIR/*.dmtcp &
wait
else
log "No checkpoint directory, creating new."
mkdir "$DMTCP_CHECKPOINT_DIR"
# Note that the job is launched in the background (using the &) and we
# explicitly call wait. Bash will not handle signals while a program
# is in the foreground, so to recieve the USR1 signal and handle the
# checkpoint, we need our workload running in the background.
log "Computing pi..."
dmtcp_launch --interval 60 Rscript compute_pi.R &
wait
fi
# This line only runs if the job ran entirely to completion.
log "Job ran to completion!"
Example Output
Sun Jun 23 15:31:04 EDT 2024 jobid=191611 Job starting
Sun Jun 23 15:31:05 EDT 2024 jobid=191611 No checkpoint directory, creating new.
Sun Jun 23 15:31:05 EDT 2024 jobid=191611 Computing pi...
Sun Jun 23 15:34:19 EDT 2024 jobid=191611 Job is ending. Checkpointing...
Sun Jun 23 15:34:20 EDT 2024 jobid=191611 Checkpoint complete.
Sun Jun 23 15:34:20 EDT 2024 jobid=191611 Spawning new job...
sbatch: Queue: work1
sbatch: Submit checks complete!
Submitted batch job 191615
Sun Jun 23 15:34:20 EDT 2024 jobid=191611 Exiting...
Sun Jun 23 15:34:21 EDT 2024 jobid=191615 Job starting
Sun Jun 23 15:34:22 EDT 2024 jobid=191615 Checkpoint directory exists.
Sun Jun 23 15:34:22 EDT 2024 jobid=191615 Restoring from checkpoint...
[2024-06-23T15:34:25.276, 40000, 40004, NOTE] at processinfo.cpp:426 in restoreHeap; REASON='Area between saved_break and curr_break not mapped, mapping it now'
_savedBrk = 86708224
curBrk = 289472512
Sun Jun 23 15:37:50 EDT 2024 jobid=191615 Job is ending. Checkpointing...
Sun Jun 23 15:37:54 EDT 2024 jobid=191615 Checkpoint complete.
Sun Jun 23 15:37:54 EDT 2024 jobid=191615 Spawning new job...
sbatch: Queue: work1
sbatch: Submit checks complete!
Submitted batch job 191617
Sun Jun 23 15:37:55 EDT 2024 jobid=191615 Exiting...
Sun Jun 23 15:37:56 EDT 2024 jobid=191617 Job starting
Sun Jun 23 15:37:57 EDT 2024 jobid=191617 Checkpoint directory exists.
Sun Jun 23 15:37:57 EDT 2024 jobid=191617 Restoring from checkpoint...
Estimated Pi: 3.141537
Used 5e+08 points.
Sun Jun 23 15:40:19 EDT 2024 jobid=191617 Job ran to completion!
If you would like to tailor this for your own jobs, here are the changes you should consider:
- Update resource requests (cpu cores, memory).
- Update max walltime as needed.
- Increase the amount of time before the job ends that Slurm sends the USR1
signal. The
compute_pi
job is pretty lightweight. For jobs that use more memory, checkpointing can take longer. If your job is 72 hours, you may want to send the signal at 30 minutes, or even an hour before the job completes. - Update script and job name.
- Update loading any modules or activating any environments (replace
module load r/4.4.0
). - Update execution call.
- Update checkpoint directory.