Skip to main content

Checkpointing Overview

Checkpointing is when you job regularly saves it's state so that if it is prematurely ended, it can be restarted and resume from the checkpoint rather than the beginning. This is a very useful concept in high-performance computing where there are numerous reasons for your job to be ended prematurely:

  • Time limit: A job will be killed when it reaches the walltime requested. For some long running jobs, there may be no way to have it complete within the cluster walltime limits. A job that is checkpointing can be continued across multiple jobs.
  • Out of memory: A job that consumes all the memory it requested will be killed. If the job was checkpointed, you could resume the job after requesting for more memory.
  • Preemption: Some of our nodes are owned and owner jobs will preempt other jobs running on the node.
  • Hardware or cluster failure: with a cluster the size of Palmetto, hardware failure (although uncommon) can happen.

Checkpointing methods

There are several main methods of checkpointing:

  • Software Built-in Checkpointing: Many HPC applications or frameworks will automatically checkpoint work. For example:

    • GROMACS
    • LAMMPS
    • OpenFOAM
    • Pytorch model training

    In some cases, you will have to consult the instructions for the software to enable the checkpointing and restarting operations. When available, the built-in checkpointing is the best choice.

  • User Programmed Checkpointing: If you are writing your own scripts or code, you should consider supporting checkpointing. The basic checkpointing recipe is:

    1. On script launch, look for saved state file.
    2. If the state file exists, restore the state and resume calculations.
    3. Periodically save the current calculation state to file.
  • Transparent Checkpointing: If your application does not support checkpointing, you can try to use a transparent checkpointer which will attempt to freeze your application, checkpointing it. The checkpointer can then be used to restore it later. These methods tend to be more brittle and have more unexpected behavior, but may be the only option if you need to run an application that does not have built-in checkpointing longer than the current walltime limit. Right now, only one option has been tested on Palmetto: