title | tags | authors | affiliations | date | bibliography | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The drake R package: a pipeline toolkit for reproducibility and high-performance computing |
|
|
|
4 January 2018 |
paper.bib |
The drake R package [@drake] is a workflow manager and computational engine for data science projects. Its primary objective is to keep results up to date with the underlying code and data. When it runs a project, drake detects any pre-existing output and refreshes the pieces that are outdated or missing. Not every runthrough starts from scratch, and the final answers are reproducible. With a user-friendly R-focused interface, comprehensive documentation, and extensive implicit parallel computing support, drake surpasses the analogous functionality in similar tools such as Make [@Make], remake [@remake], memoise [@memoise], and knitr [@knitr].
In reproducible research, drake's role is to provide tangible evidence that a project's results are re-creatable. drake quickly detects when the code, data, and output are synchronized. In other words, drake helps determine if the starting materials would produce the expected output if the project were to start over and run from scratch. This approach decreases the time and effort it takes to evaluate research projects for reproducibility.
Regarding high-performance computing, drake interfaces with a variety of technologies and scheduling algorithms to deploy the steps of a data analysis project. Here, the parallel computing is implicit. In other words, drake constructs the directed acyclic network of the workflow and determines which steps can run simultaneously and which need to wait for dependencies. This automation eases the cognitive and computational burdens on the user, enhancing the readability of code and thus reproducibility.