Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of requeueing in SLURM #201

Open
dthulke opened this issue Jul 22, 2024 · 2 comments
Open

Handling of requeueing in SLURM #201

dthulke opened this issue Jul 22, 2024 · 2 comments

Comments

@dthulke
Copy link
Member

dthulke commented Jul 22, 2024

SLURM can automatically requeue jobs (e.g. on node failure or preemption of a higher priority job: https://slurm.schedmd.com/sbatch.html#OPT_requeue). In general this is similar to the resume function we have in sisyphus with the added bonus that jobs keep their priority.

If this is enabled (i.e. if you don't specify the flag in sbatch the default is defined by the slurm.conf), this causes a few issues:

  1. As the job id does not change, the log file of the previous run is overwritten (this actually triggered me to look into this)
  2. Non-resumable tasks are resumed
    • This would be easy to fix by always setting --no-requeue (https://slurm.schedmd.com/sbatch.html#OPT_no-requeue) for non-resumable tasks. But, this would require to pass the information whether a task is resumable to the submit call function
      def submit_call(self, call, logpath, rqmt, name, task_name, task_ids):
      what would also potentially break custom engine implementations (but should be an easy fix and I only know of a single custom engine implementation by @Zettelkasten). <-- my preferred solution

Alternatively, both issues would be fixed by always setting --no-requeue but then we would loose the advantages for resumable jobs.

Are there any other opinions? If not I'd create a PR for the two fixes.

@JackTemaki
Copy link
Contributor

For me your proposed options sound valid. For the log file I see no issues at all, for the second one this maybe needs an additional look but should also be fine.

@critias
Copy link
Contributor

critias commented Jul 23, 2024

The local engine already appends it's log to the last log file. I think it's a good idea to have a clearly visible separation between different entries similar to this:

logfile.write("\n" + ("#" * 80) + "\nRETRY OR CONTINUE TASK\n" + ("#" * 80) + "\n\n")

Beside that appending to the existing log file sounds good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants