Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
checkpoint_techniques_on_compute_canada_clusters [2015/03/27 15:05] – [Manual checkpoints] 132.216.122.26 | checkpoint_techniques_on_compute_canada_clusters [2024/03/26 13:52] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | These are the notes for the Checkpoint Techniques workshop I attended on March 26th, 2015. Might be useful for people who want to learn how to code this on their own programs. Please don't hesitate to edit this page if you feel I left something out, you want to add something on your own or my English sounds | + | These are the notes for the Checkpoint Techniques workshop I attended on March 26th, 2015 (the workshop materials can be found [[http:// |
===== Random stuff ===== | ===== Random stuff ===== | ||
- | * Maximum '' | + | |
- | * There is a [[https:// | + | |
===== The problem ===== | ===== The problem ===== | ||
Line 17: | Line 17: | ||
==== Manual checkpoints ==== | ==== Manual checkpoints ==== | ||
- | Manual checkpoints require you to modify your code to manually save your system state every N iterations (for example) and then, upon start, check if there is some previously saved state that can be used as the initial seed. In some cases this might not be possible, depending on your particular job (for instance, it might be a very long optimization process that depends on some internal hysteresis data that is not offered as output). The particular file format used for this task is up to you. In my case, I save intermediate parameter values in '' | + | Manual checkpoints require you to modify your code to manually save your system state every N iterations (for example) and then, upon start, check if there is some previously saved state that can be used as the initial seed. In some cases this might not be possible, depending on your particular job (for instance, it might be a very long optimization process that depends on some internal hysteresis data that is not offered as output). The particular file format used for this task is up to you. In my case, I save intermediate parameter values in '' |
If you are using parallel programs, you might be interested in HDF5, which allows to read/write data in parallel environments. Check [[http:// | If you are using parallel programs, you might be interested in HDF5, which allows to read/write data in parallel environments. Check [[http:// | ||
Line 45: | Line 45: | ||
# 7779 by default, but if there are several DMTCP schedulers running on | # 7779 by default, but if there are several DMTCP schedulers running on | ||
# the same node we will have problems. The best solution is to assign the | # the same node we will have problems. The best solution is to assign the | ||
- | # port number manually. | + | # port number manually. Also, if PORT=0, a random unused port will be |
+ | # chosen, which is probably better. | ||
PORT=7745 | PORT=7745 | ||
Line 59: | Line 60: | ||
else | else | ||
# The -i switch tells dmtcp_launch the time in seconds between | # The -i switch tells dmtcp_launch the time in seconds between | ||
- | # checkpoints. | + | # checkpoints. Probably 60 is too small, so set it up accordingly. |
dmtcp_launch -i 60 -p ${PORT} ./my_job | dmtcp_launch -i 60 -p ${PORT} ./my_job | ||
fi | fi | ||
</ | </ | ||
- | That's it. DMTCP will regularly create system dumps and link '' | + | That's it; we will use this script to launch our job and to re-launch it if we need to. DMTCP will regularly create system dumps and link '' |
=== What if I want to launch N jobs using DMTCP? Is that possible? === | === What if I want to launch N jobs using DMTCP? Is that possible? === | ||
Yes. You just have to remember to assign each job a different port number and run every job from its own subdirectory to avoid any conflict. | Yes. You just have to remember to assign each job a different port number and run every job from its own subdirectory to avoid any conflict. | ||
+ | |||
+ | Consider my particular case: I need to run several R programs (hundreds). Each script will run on one node. This is the Python code that I have used to generate the bash scripts that I will send to '' | ||
+ | |||
+ | |||
+ | <file python generatordmtcp.py> | ||
+ | # | ||
+ | import os | ||
+ | |||
+ | # New version of this script. Now we use DMTCP to launch | ||
+ | # the scripts. | ||
+ | |||
+ | def chunks(l, n): | ||
+ | """ | ||
+ | """ | ||
+ | for i in xrange(0, len(l), n): | ||
+ | yield l[i:i+n] | ||
+ | |||
+ | ## MAIN ## | ||
+ | if (__name__ == " | ||
+ | |||
+ | NPROCS = 12 | ||
+ | |||
+ | # Get list of scripts to run. They are files with both | ||
+ | # ' | ||
+ | # script is one job to launch. | ||
+ | scripts = os.listdir(' | ||
+ | scripts = filter(lambda x: x.find(' | ||
+ | scripts = filter(lambda x: x.find(' | ||
+ | scripts.sort() | ||
+ | id = 0 | ||
+ | # We'll save temporary results in the projects directory, so we | ||
+ | # don't have to worry about quotas on the scratch one. Might need | ||
+ | # these data for several weeks. | ||
+ | optdir = "/ | ||
+ | # Port list for DMTCP | ||
+ | ports = range(7701, 7713) | ||
+ | |||
+ | ## MAIN LOOP ## | ||
+ | for batch in chunks(scripts, | ||
+ | id = id + 1 | ||
+ | jobname = " | ||
+ | btemp = """# | ||
+ | #PBS -A eim-670-aa | ||
+ | #PBS -l nodes=1: | ||
+ | #PBS -l walltime=00: | ||
+ | #PBS -V | ||
+ | #PBS -N %s | ||
+ | #PBS -o %s | ||
+ | #PBS -e %s | ||
+ | |||
+ | function rundmtcpjob () { | ||
+ | jobfile=$1 | ||
+ | port=$2 | ||
+ | |||
+ | jobname=$(basename ${jobfile} .R) | ||
+ | optdir=/ | ||
+ | |||
+ | # Create job directory within ${optdir} and copy all files there | ||
+ | # If it already exists, it might mean the script already run once, | ||
+ | # so don't do anything. | ||
+ | scdir=${optdir}/ | ||
+ | if [ ! -e ${scdir} ] | ||
+ | then | ||
+ | mkdir ${scdir} | ||
+ | cp -va * ${scdir} | ||
+ | fi | ||
+ | | ||
+ | # Move to $scdir and run the script using dmtcp_launch, | ||
+ | # workshop. Will use that directory as the temporary one. | ||
+ | cd ${scdir} | ||
+ | if [ -e " | ||
+ | then | ||
+ | ./ | ||
+ | else | ||
+ | dmtcp_launch -i 86400 -p ${port} R CMD BATCH ${jobfile} | ||
+ | fi | ||
+ | } | ||
+ | |||
+ | cd / | ||
+ | |||
+ | """ | ||
+ | | ||
+ | | ||
+ | |||
+ | jobsfile = jobname + ' | ||
+ | f = open(jobsfile, | ||
+ | f.write(btemp) | ||
+ | for i in range(len(batch)): | ||
+ | line = " | ||
+ | f.write(line) | ||
+ | f.write(" | ||
+ | f.close() | ||
+ | os.chmod(jobsfile, | ||
+ | # end for loop | ||
+ | </ | ||
+ | |||
+ | In the end, this script generates a bunch of '' | ||
+ | |||
+ | **Currently this is not working as expected; for some unknown reason, only 2 random jobs get re-started. I have contacted Calcul Québec about this and they should reply shortly. I will update this page with a bug-free script (or whatever solution they give me.)** | ||
+ | |||
+ | **Update 2: they did not reply.** |