Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
checkpoint_techniques_on_compute_canada_clusters [2015/03/30 19:29] – [Automatic checkpoints] 132.216.122.26 | checkpoint_techniques_on_compute_canada_clusters [2024/03/26 13:52] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 45: | Line 45: | ||
# 7779 by default, but if there are several DMTCP schedulers running on | # 7779 by default, but if there are several DMTCP schedulers running on | ||
# the same node we will have problems. The best solution is to assign the | # the same node we will have problems. The best solution is to assign the | ||
- | # port number manually. | + | # port number manually. Also, if PORT=0, a random unused port will be |
+ | # chosen, which is probably better. | ||
PORT=7745 | PORT=7745 | ||
Line 78: | Line 79: | ||
# New version of this script. Now we use DMTCP to launch | # New version of this script. Now we use DMTCP to launch | ||
- | # the scripts | + | # the scripts. |
def chunks(l, n): | def chunks(l, n): | ||
Line 167: | Line 168: | ||
In the end, this script generates a bunch of '' | In the end, this script generates a bunch of '' | ||
- | **Currently this is not working as expected. I have contacted Calcul Québec about this and they should reply shortly. I will update this page with a bug-free script (or whatever solution they give me.)** | + | **Currently this is not working as expected; for some unknown reason, only 2 random jobs get re-started. I have contacted Calcul Québec about this and they should reply shortly. I will update this page with a bug-free script (or whatever solution they give me.)** |
+ | |||
+ | **Update 2: they did not reply.** |