Slurm: WORKDIR files overwritten on multistep-stage specs #311

Sinclert · 2021-04-22T15:45:59Z

This issue describes an undesirable behaviour found within the SlurmJobManagerCERN class, discovered between Carl Evans (NYU HPC) and myself (NYU CDS).

Context

We are currently trying to run a complex workflow (see madminer-workflow for reference) on REANA 0.7.3, using SLURM as the computational backend. The workflow specification is written in Yadage, and it is totally functional on REANA 0.7.1, when using Kubernetes as the computational backend.

Problem

The problem is found on any Yadage spec. using the multistep-stage scheduler_type value (where multiple "step-jobs" are run in parallel), when those "step-jobs" depend on scattered files to perform their computations.

In those scenarios, the SlurmJobManagerCERN._download_dir function, in addition to be somehow inefficient (it crawls through every file and directory in the SLURM workdir, making each step to scan everything all previous steps created), overrides the whole workflow WORKDIR at the start of each "step-job".

We have recently raised concerns about this behaviour on the REANA Mattermost channel (precisely here) where we thought the problem was due to the publisher_type within the Yadage specification. Turns out that was not the case, but instead it is due to the scheduler_type multistep-stage value.

Testing

We did some preliminary testing to properly identify the scope of the issue.

We are fairly sure the issue is located within the SlurmJobManagerCERN._download_dir function, as we have performed some testing on a custom reana-job-controller Docker image (where we have tuned this function and hardcoded some paths to our needs), and we were able to run the complete workflow successfully ✅

Possible solution

We believe a good patch would involve reducing the scope of the SlurmJobManagerCERN._download_dir function WORKDIR copying procedure, from the "workflow" level to the "step-job" level. That way, there will not be any overriding problems among parallel "step-jobs" within the same workflow stage.

Additional clarifications

This issue has not being detected in any of the workflows you guys use for testing because none of them use multistep-stage scheduler_type values, involving files. See:

@lukasheinrich offered to create a dummy workflow to test this behaviour, but no progress has been done so far (message).

The text was updated successfully, but these errors were encountered:

cranmer · 2021-04-28T18:30:39Z

Hello, if I may add that this is quite time sensitive for the SCAILFIN project as it is tied to the scalability tests that we promised as a deliverable for the NSF grant. That grant ends at the end of this summer, so we were hoping to do the tests this spring/early summer. @tiborsimko

Sinclert · 2021-05-10T16:53:06Z

Hey there 👋🏻

With the aim of speeding things a bit (and given that I got not response from Lukas), I created a minimum example workflow to debug the described problem. Check it out at Scailfin/reana-slurm-test.

Within the repo, you can find instructions on how to run the workflow on Kubernetes and Slurm. Once you do, you will discover that Slurm runs always crash with Bravado HTTP errors, which are misleading, as they hide the real problem (which got described above).

irinaespejo · 2021-06-08T13:35:25Z

Hi !

I am commenting in this issue because of two things

@tiborsimko could you confirm/ comment if the REANA Developer Team would be able to solve the issue? The issue is time-sensitive for us and high priority. Thank you. Also, Sinclert mentioned that this commit is important fot the issue, so maybe @roksys can shed some light. Thanks!
Second, based on what @Sinclert said in the opening message of this issue

The workflow specification is written in Yadage, and it is totally functional on REANA 0.7.1, when using Kubernetes as the computational backend.

I have tried to submit the madminer-worrfklow with Kubernetes as backend and REANA version 0.7.1 and the workflow fails at the multistep-stage step. Does version 0.7.1 makes reference to this line?
The same madminer-workflow has been succesfully deployed at BNL and NYU with Kubernetes as backend.

Here I post screenshots of the failing

roksys · 2021-06-18T19:15:50Z

Hey @irinaespejo I no longer work for CERN/REANA, so I won't be able to provide much help, but I think that using rsync instead of sftp within _download_dir method would solve the issue.

Sinclert · 2021-06-18T19:40:07Z

Hi @roksys ,

I am unsure whether that alone would solve the problem. Replacing sftp by rsync without reducing the scope of the command (from the workflow level folder to the step level folder), could still run into race condition issues.

If I am not mistaken, it would be as running rsync -r <src_dir> <dst_dir> at the same time (at the start of every parallel job), with exactly the same arguments... I think this StackExchange question highlights the problem with the approach you are proposing.

Sinclert added type/bug compute/slurm labels Apr 22, 2021

Sinclert changed the title ~~Slurm: WORKDIR files overwriting on multistep-stage specs~~ Slurm: WORKDIR files overwritten on multistep-stage specs May 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slurm: WORKDIR files overwritten on multistep-stage specs #311

Slurm: WORKDIR files overwritten on multistep-stage specs #311

Sinclert commented Apr 22, 2021

cranmer commented Apr 28, 2021

Sinclert commented May 10, 2021

irinaespejo commented Jun 8, 2021 •

edited

Loading

roksys commented Jun 18, 2021

Sinclert commented Jun 18, 2021

Slurm: WORKDIR files overwritten on multistep-stage specs #311

Slurm: WORKDIR files overwritten on multistep-stage specs #311

Comments

Sinclert commented Apr 22, 2021

Context

Problem

Testing

Possible solution

Additional clarifications

cranmer commented Apr 28, 2021

Sinclert commented May 10, 2021

irinaespejo commented Jun 8, 2021 • edited Loading

roksys commented Jun 18, 2021

Sinclert commented Jun 18, 2021

irinaespejo commented Jun 8, 2021 •

edited

Loading