A multithreaded package to validate, curate, and transform large heterogeneous datasets using reproducible recipes, which can be created both in TOML human readable format, or in Julia.
A key aim of this package is that recipes can be read/written by any researcher without the need for being able to write code, making data sharing/validation faster, more accurate, and reproducible.
DataCurator is a Swiss army knife that ensures:
- pipelines can focus on the algorithm/problem solving
- human readable "recipes" for future reproducibility
- validation of huge datasets at high speed
- out-of-the-box operation without the need for code or dependencies
DataCurator requires a command-line interface and is supported on Linux, Windows Subsystem for Linux (WSL2), and MacOS. See Quickstart and Installation for detail.
- Quickstart via Singularity
- Status
- Documentation (including installation)
- What to Find Where
- Preprint/Cite
- Troubleshooting
The recommended way to use DataCurator is via the Singularity container. Note this is only supported in Linux, Windows Subsystem for Linux (WSL2), and MacOS (x86). For ARM-based Macs (e.g. from early 2021 onward), use the Docker container or source codes - see installation for detail.
wget https://github.com/apptainer/singularity/releases/download/v3.8.7/singularity-container_3.8.7_amd64.deb
sudo apt-get install ./singularity-container_3.8.7_amd64.deb
Please refer to the Singularity docs.
After installation, test by typing in a terminal singularity --version
. This will return singularity version 3.8.7
singularity pull datacurator.sif library://bcvcsert/datacurator/datacurator:latest
The container image can be also found at Sylabs.
chmod u+x ./datacurator.sif
Depending on the directory you're in, you may need to grant Singularity read/write access. By default Singularity has read/write access to $HOME, no other directory.
export SINGULARITY_BINDPATH=${PWD}
wget https://raw.githubusercontent.com/bencardoen/DataCurator.jl/main/example_recipes/count.toml
mkdir testdir
touch testdir/text.txt
./datacurator.sif -r count.toml
That should show output similar to
The recipe used can be found here.
What next? Check out two simple examples of use cases and TOML recipes, and follow that with the large collection of well commented example recipes or the complete walkthrough of DataCurator. Please see the documentation.
The outcome of automated tests (including building on Mac OS & Debian docker image) :
Code coverage (which parts of the source code are tested) :
For full documentation, click here >> . This includes more detailed installation docs, two simple examples of use cases and TOML recipes, well-commented example recipes, complete walkthrough of DataCurator, and more.
repository
├── example_recipes ## Start here for easy to copy example recipes
├── docs
│ ├── src ## Documentation in markdown format (viewable online as well)
│ │ ├── make.jl ## `cd docs && julia --project=.. make.jl` to rebuild docs
├── singularity ## Singularity image instructions
├── src ## source code of the package itself
├── scripts ## Utility scripts to run DC, generate test data, ...
├── test ## test suite and related files
└── runjulia.sh ## Required for Singularity image
└── buildimage.sh ## Rebuilds singularity image for you (Needs root !!)
You can find our preprint here.
If you have any issue, please search the issues to see if your problem has been encountered before. If not, please create a new issue, and follow the templates for bugs and / or features you wish to be added.
If you have a workflow that DataCurator right now does not support, or not the way you'd like it to, you can mention this too. In that case, do share a minimum example of your data so we can add, upon completion of the feature, a new testcase.
DataCurator relies heavily on existing Julia packages for specialized functionality: