Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Martian allow caching of runs into a db? #54

Open
adarshdas1995 opened this issue Apr 23, 2021 · 3 comments
Open

Does Martian allow caching of runs into a db? #54

adarshdas1995 opened this issue Apr 23, 2021 · 3 comments
Labels

Comments

@adarshdas1995
Copy link

Hello,
I am wondering if Martian allows caching of runs using a database, similar to how Cromwell does it. I could not find much in the docs.

@adam-azarchs
Copy link
Member

Could you please clarify what it is you're asking about? I'm not sure what you mean by "caching" in this context. Cromwell has a rather different scope from what martian - martian doesn't have a "server" process, for example.

Martian stores all of the state of a run in the filesystem. It can be made to put the metadata (other than the top-level metadata files) in a single _metadata.zip file at the end of the run with the --zip flag. But if you want the metadata in a database you'd need to have a separate process to poll it; martian was specifically designed for running in situations where a traditional database couldn't be used.

@likhitha-surapaneni
Copy link

Hi, I am trying to understand the subsection "Restarting" pipelines in the documentation (https://martian-lang.org/running-pipelines/). It is stated that "Stages that have already completed successfully will not be reset or re-run". I have the following questions:

  1. How does mrp keep track of the completed stages?
  2. When a user A reruns the pipestance run by another user B from the same directory, does mrp start the pipestance from the beginning or does it pick from a stage after the successful stages ?
  3. How can user A utilise the completed stages run by user B when running from different directories?

Thank you

@adam-azarchs
Copy link
Member

To answer the first question, the state of each stage is represented by the files in each job directory, e.g. SUM_SQUARE_PIPELINE/SUM_SQUARES/fork0/chnk1. For example, a job which has started will have a _log file, and a completed job will have a _complete file.

If another user restarts the pipestance in the same directory, the behavior will be the same as if the original user had restarted it, so long as the file permissions allow that user write access to the pipestance directory. However, the umask value configured on most systems will mean that the pipestance directory will not be writeable for other users.

I believe it should work if another user copies the pipestance directory over to a location they control (using cp -a, never cp -r, to ensure symlinks are preserved), they should be able to restart the pipestance in that directory and have everything work. You could actually use rsync -a --exclude="*/files/*" old_pipestance_directory/ new_pipestance_directory (note that rsync really cares a lot about whether there's a trailing / or not on the source and destination, so be careful there) to avoid copying intermediate stage output files, since the copied metadata files refer to those in their original locations by absolute path.

Note that because the pipestance metadata may contain paths referring to files in the old pipestance directory, if the first user restarts their pipestance some of those files may get deleted by VDR, so I would not recommend copying a pipestance this way if they might do that or if they might delete the pipestance. To protect yourself from that you'd need to not exclude the /files subdirectories and also use e.g. sed to update the paths in all _args and _outs files in the pipestance tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants