Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at modular DataPipeline #1

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from
Draft

Conversation

ethho
Copy link

@ethho ethho commented Aug 23, 2021

  • This PR implements a modular replacement for the alphafold.data.pipeline.DataPipeline class, contained in a new class alphafold.data.pipeline.ModularDataPipeline.
  • Instead of calling pipeline "runner" instances directly (as DataPipeline does), ModularDataPipeline makes calls to functions in new module alphafold.data.tools.cli, which wrap construction and execution of these "runner" instances.
  • For instance, there are functions that execute jackhmmer, hhsearch, hhblits. All of these steps will attempt to cache results to a pickle file based on their hashed input (kw)args, using the new alphafold.data.tools.cache_utils.cache_to_pckl decorator. This allows the user to avoid repetitive calls to expensive operations such as hhblits.
  • alphafold.data.tools.cli also provides a Click CLI subcommand for each pipeline step. These subcommands are installed under the af2 console_script in the setup.py. For instance, to run only the jackhmmer_uniref90 step of the pipeline, issue the following after installing the AlphaFold2 package:
af2 jackhmmer --input-fasta-path /scratch/projects/tacc/bio/alphafold/test-container/input/sample.fasta --jackhmmer-binary-path /usr/bin/jackhmmer --database-path /scratch/projects/tacc/bio/alphafold/data/uniref90/uniref90.fasta --output-dir $PWD
  • Pipeline steps run through the Click CLI share cache with ModularDataPipeline.
  • Several commits from upstream deepmind/alphafold

@ethho
Copy link
Author

ethho commented Aug 23, 2021

Usage

Clone Ethan's Fork

git clone [email protected]:eho-tacc/alphafold.git af2-eho-fork
cd af2-eho-fork

Running Entire DataPipeline

The run_alphafold.py script installed in the module should still work as expected. On S2 idev:

singularity exec /scratch/projects/tacc/bio/alphafold/images/alphafold_2.0.0.sif python3 run_alphafold.py --flagfile= /scratch/projects/tacc/bio/alphafold/test-container/flags/reduced_dbs.ff \
	--fasta_paths= /scratch/projects/tacc/bio/alphafold/test-container/input/sample.fasta \
    --output_dir=$SCRATCH/af2_reduced \
	--model_names=model_1

Running One Step of the DataPipeline

On S2 idev:

# Install the AF2 console script
singularity exec /scratch/projects/tacc/bio/alphafold/images/alphafold_2.0.0.sif bash -c 'python3 -m pip install -q --user . && $HOME/.local/bin/af2 jackhmmer --help'

# Run only the jackhmmer_uniref90 step
singularity exec /scratch/projects/tacc/bio/alphafold/images/alphafold_2.0.0.sif $HOME/.local/bin/af2 jackhmmer --input-fasta-path /scratch/projects/tacc/bio/alphafold/test-container/input/sample.fasta --jackhmmer-binary-path /usr/bin/jackhmmer --database-path /scratch/projects/tacc/bio/alphafold/data/uniref90/uniref90.fasta --output-dir $PWD

Output Caching

Very little (MB) data is transferred between steps of the DataPipeline. This means that we can cheaply cache the outputs of expensive steps such as hhblits so that the user can avoid re-running an alignment with the same set of inputs.

After running a step of the DataPipeline (either using run_alphafold.py or a single step via CLI), the output will cache to a pickle (*.pckl) file in the directory defined by environment variable AF2_CACHE_DIR ($PWD/.cache by default). Subsequent runs will hash the input args/kwargs of the step, look for a corresponding pickle file in the cache dir, and return the deserialized contents of the cache instead of rerunning the step. If no match is found, the step is run as usual and outputs written to the cache.

This fork verbosely logs this behavior. To change the location of the cache, set the environment variable AF2_CACHE_DIR:

AF2_CACHE_DIR="/tmp"

To disable reading from the cache, forcing every step to rerun, set the environment variable AF2_SKIP_PCKL_CACHE:

export AF2_SKIP_PCKL_CACHE=1

@ethho
Copy link
Author

ethho commented Aug 23, 2021

Testing

Coverage on Systems

  • Stampede2
  • Frontera

Expected Behavior

  • Running entire DataPipeline using run_alphafold.py caches to $PWD/.cache by default. This behavior should log to the stdout/stderr.
  • Running entire DataPipeline with same params reads from cache instead of re-running all steps. Again, should be evident from logs.
  • Running individual steps of DataPipeline also finds the cache instead of rerunning
  • export AF2_SKIP_PCKL_CACHE=1 forces re-run despite cache
  • Moving cache or setting AF2_CACHE_DIR (e.g. export AF2_CACHE_DIR=/tmp) forces rerun
  • Changing input (kw)args (e.g. changing --input-fasta-path forces rerun)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants