Handling of requeueing in SLURM #201

dthulke · 2024-07-22T16:41:19Z

SLURM can automatically requeue jobs (e.g. on node failure or preemption of a higher priority job: https://slurm.schedmd.com/sbatch.html#OPT_requeue). In general this is similar to the resume function we have in sisyphus with the added bonus that jobs keep their priority.

If this is enabled (i.e. if you don't specify the flag in sbatch the default is defined by the slurm.conf), this causes a few issues:

As the job id does not change, the log file of the previous run is overwritten (this actually triggered me to look into this)
- The nicest option would be to be able to create separate files under engine/ for each run (that's the behaviour as without requeue as the slurm job id changes). But this is afaik not possible as the restart number is not available in the corresponding file pattern: https://slurm.schedmd.com/sbatch.html#SECTION_FILENAME-PATTERN
- Set --open-mode=append https://slurm.schedmd.com/sbatch.html#OPT_open-mode so that the previous log file is kept in the same file <-- my preferred solution
Non-resumable tasks are resumed
- This would be easy to fix by always setting --no-requeue (https://slurm.schedmd.com/sbatch.html#OPT_no-requeue) for non-resumable tasks. But, this would require to pass the information whether a task is resumable to the submit call function
  
  sisyphus/sisyphus/engine.py
  
  Line 36 in a22e923
  
  def submit_call(self, call, logpath, rqmt, name, task_name, task_ids):
  
  what would also potentially break custom engine implementations (but should be an easy fix and I only know of a single custom engine implementation by @Zettelkasten). <-- my preferred solution

Alternatively, both issues would be fixed by always setting --no-requeue but then we would loose the advantages for resumable jobs.

Are there any other opinions? If not I'd create a PR for the two fixes.

The text was updated successfully, but these errors were encountered:

JackTemaki · 2024-07-23T07:59:25Z

For me your proposed options sound valid. For the log file I see no issues at all, for the second one this maybe needs an additional look but should also be fine.

critias · 2024-07-23T08:25:35Z

The local engine already appends it's log to the last log file. I think it's a good idea to have a clearly visible separation between different entries similar to this:

sisyphus/sisyphus/worker.py

Line 206 in a22e923

    
           logfile.write("\n" + ("#" * 80) + "\nRETRY OR CONTINUE TASK\n" + ("#" * 80) + "\n\n")

Beside that appending to the existing log file sounds good to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of requeueing in SLURM #201

Handling of requeueing in SLURM #201

dthulke commented Jul 22, 2024

JackTemaki commented Jul 23, 2024

critias commented Jul 23, 2024

Handling of requeueing in SLURM #201

Handling of requeueing in SLURM #201

Comments

dthulke commented Jul 22, 2024

JackTemaki commented Jul 23, 2024

critias commented Jul 23, 2024