Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
checkpoint_techniques_on_compute_canada_clusters [2015/03/27 21:10] – [Automatic checkpoints] 132.216.122.26 | checkpoint_techniques_on_compute_canada_clusters [2024/03/26 13:52] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | These are the notes for the Checkpoint Techniques workshop I attended on March 26th, 2015. Might be useful for people who want to learn how to code this on their own programs. Please don't hesitate to edit this page if you feel I left something out, you want to add something on your own or my English sounds funny. | + | These are the notes for the Checkpoint Techniques workshop I attended on March 26th, 2015 (the workshop materials can be found [[http:// |
===== Random stuff ===== | ===== Random stuff ===== | ||
- | * Maximum '' | + | * Maximum '' |
* There is a [[https:// | * There is a [[https:// | ||
Line 45: | Line 45: | ||
# 7779 by default, but if there are several DMTCP schedulers running on | # 7779 by default, but if there are several DMTCP schedulers running on | ||
# the same node we will have problems. The best solution is to assign the | # the same node we will have problems. The best solution is to assign the | ||
- | # port number manually. | + | # port number manually. Also, if PORT=0, a random unused port will be |
+ | # chosen, which is probably better. | ||
PORT=7745 | PORT=7745 | ||
Line 78: | Line 79: | ||
# New version of this script. Now we use DMTCP to launch | # New version of this script. Now we use DMTCP to launch | ||
- | # the scripts | + | # the scripts. |
def chunks(l, n): | def chunks(l, n): | ||
Line 110: | Line 111: | ||
id = id + 1 | id = id + 1 | ||
jobname = " | jobname = " | ||
- | gnuparcommand = " | ||
- | (len(batch), | ||
- | " | ||
- | |||
btemp = """# | btemp = """# | ||
#PBS -A eim-670-aa | #PBS -A eim-670-aa | ||
Line 152: | Line 149: | ||
cd / | cd / | ||
- | |||
- | export -f rundmtcpjob | ||
- | |||
- | %s | ||
- | |||
- | # wait # with parallel it is not necessary | ||
""" | """ | ||
| | ||
- | | + | |
- | | + | |
jobsfile = jobname + ' | jobsfile = jobname + ' | ||
f = open(jobsfile, | f = open(jobsfile, | ||
f.write(btemp) | f.write(btemp) | ||
+ | for i in range(len(batch)): | ||
+ | line = " | ||
+ | f.write(line) | ||
+ | f.write(" | ||
f.close() | f.close() | ||
os.chmod(jobsfile, | os.chmod(jobsfile, | ||
# end for loop | # end for loop | ||
- | |||
</ | </ | ||
- | This script | + | In the end, this script |
+ | |||
+ | **Currently this is not working as expected; for some unknown reason, only 2 random jobs get re-started. I have contacted Calcul Québec about this and they should reply shortly. I will update this page with a bug-free script (or whatever solution they give me.)** | ||
- | In the end, this script generates a bunch of '' | + | **Update 2: they did not reply.** |