First pass at modular DataPipeline #1

ethho · 2021-08-23T19:25:45Z

This PR implements a modular replacement for the alphafold.data.pipeline.DataPipeline class, contained in a new class alphafold.data.pipeline.ModularDataPipeline.
Instead of calling pipeline "runner" instances directly (as DataPipeline does), ModularDataPipeline makes calls to functions in new module alphafold.data.tools.cli, which wrap construction and execution of these "runner" instances.
For instance, there are functions that execute jackhmmer, hhsearch, hhblits. All of these steps will attempt to cache results to a pickle file based on their hashed input (kw)args, using the new alphafold.data.tools.cache_utils.cache_to_pckl decorator. This allows the user to avoid repetitive calls to expensive operations such as hhblits.
alphafold.data.tools.cli also provides a Click CLI subcommand for each pipeline step. These subcommands are installed under the af2 console_script in the setup.py. For instance, to run only the jackhmmer_uniref90 step of the pipeline, issue the following after installing the AlphaFold2 package:

af2 jackhmmer --input-fasta-path /scratch/projects/tacc/bio/alphafold/test-container/input/sample.fasta --jackhmmer-binary-path /usr/bin/jackhmmer --database-path /scratch/projects/tacc/bio/alphafold/data/uniref90/uniref90.fasta --output-dir $PWD

Pipeline steps run through the Click CLI share cache with ModularDataPipeline.
Several commits from upstream deepmind/alphafold

PiperOrigin-RevId: 388172326 Change-Id: I11d9e498c226cd752947feb51b7d1eb343b4d7ab

PiperOrigin-RevId: 388195957 Change-Id: Id22b96c964a82e3450a5f457d8facd1e85b1e86a

PiperOrigin-RevId: 390355132 Change-Id: I920a80db674541af41ee83ef3d5bd5255c782ee9

PiperOrigin-RevId: 390553964 Change-Id: Ia5bc6e12ab3f1a7827b7d914c93f6990a1139780

PiperOrigin-RevId: 390566020 Change-Id: I3fafbe8246d0a5ad018f0398b39bf7dacee00468

ethho · 2021-08-23T19:45:16Z

Usage

Clone Ethan's Fork

git clone [email protected]:eho-tacc/alphafold.git af2-eho-fork
cd af2-eho-fork

Running Entire DataPipeline

The run_alphafold.py script installed in the module should still work as expected. On S2 idev:

singularity exec /scratch/projects/tacc/bio/alphafold/images/alphafold_2.0.0.sif python3 run_alphafold.py --flagfile= /scratch/projects/tacc/bio/alphafold/test-container/flags/reduced_dbs.ff \
	--fasta_paths= /scratch/projects/tacc/bio/alphafold/test-container/input/sample.fasta \
    --output_dir=$SCRATCH/af2_reduced \
	--model_names=model_1

Running One Step of the DataPipeline

On S2 idev:

# Install the AF2 console script
singularity exec /scratch/projects/tacc/bio/alphafold/images/alphafold_2.0.0.sif bash -c 'python3 -m pip install -q --user . && $HOME/.local/bin/af2 jackhmmer --help'

# Run only the jackhmmer_uniref90 step
singularity exec /scratch/projects/tacc/bio/alphafold/images/alphafold_2.0.0.sif $HOME/.local/bin/af2 jackhmmer --input-fasta-path /scratch/projects/tacc/bio/alphafold/test-container/input/sample.fasta --jackhmmer-binary-path /usr/bin/jackhmmer --database-path /scratch/projects/tacc/bio/alphafold/data/uniref90/uniref90.fasta --output-dir $PWD

Output Caching

Very little (MB) data is transferred between steps of the DataPipeline. This means that we can cheaply cache the outputs of expensive steps such as hhblits so that the user can avoid re-running an alignment with the same set of inputs.

After running a step of the DataPipeline (either using run_alphafold.py or a single step via CLI), the output will cache to a pickle (*.pckl) file in the directory defined by environment variable AF2_CACHE_DIR ($PWD/.cache by default). Subsequent runs will hash the input args/kwargs of the step, look for a corresponding pickle file in the cache dir, and return the deserialized contents of the cache instead of rerunning the step. If no match is found, the step is run as usual and outputs written to the cache.

This fork verbosely logs this behavior. To change the location of the cache, set the environment variable AF2_CACHE_DIR:

AF2_CACHE_DIR="/tmp"

To disable reading from the cache, forcing every step to rerun, set the environment variable AF2_SKIP_PCKL_CACHE:

export AF2_SKIP_PCKL_CACHE=1

ethho · 2021-08-23T20:47:58Z

Testing

Coverage on Systems

Stampede2
Frontera

Expected Behavior

Running entire DataPipeline using run_alphafold.py caches to $PWD/.cache by default. This behavior should log to the stdout/stderr.
Running entire DataPipeline with same params reads from cache instead of re-running all steps. Again, should be evident from logs.
Running individual steps of DataPipeline also finds the cache instead of rerunning
export AF2_SKIP_PCKL_CACHE=1 forces re-run despite cache
Moving cache or setting AF2_CACHE_DIR (e.g. export AF2_CACHE_DIR=/tmp) forces rerun
Changing input (kw)args (e.g. changing --input-fasta-path forces rerun)

Augustin-Zidek and others added 26 commits August 2, 2021 02:04

Fix a few typos.

41232dc

PiperOrigin-RevId: 388172326 Change-Id: I11d9e498c226cd752947feb51b7d1eb343b4d7ab

Add a direct link to PDB70 2020-05-13.

547cbdd

PiperOrigin-RevId: 388195957 Change-Id: Id22b96c964a82e3450a5f457d8facd1e85b1e86a

Remove a redundant space.

05db248

PiperOrigin-RevId: 390355132 Change-Id: I920a80db674541af41ee83ef3d5bd5255c782ee9

Set versions of libraries AlphaFold Colab notebook depends on.

f65e94f

PiperOrigin-RevId: 390553964 Change-Id: Ia5bc6e12ab3f1a7827b7d914c93f6990a1139780

Use pLDDT in the B-factor column of the output PDBs.

cef198e

PiperOrigin-RevId: 390566020 Change-Id: I3fafbe8246d0a5ad018f0398b39bf7dacee00468

git co c2bc13b .gitignore

79e8596

cp alphafold/data/pipeline.py

243f99a

Halfway through refactor as a dataclass

c3a5d0c

Finish process refactor

95f7242

Entrypoint script uses ModularDataPipeline

a8bdff0

git co c2bc13b job.slurm

273daf6

Fix syntax errors

e41a011

Move cli interface to dedicated submodule

c1d0e1f

Add afpipeline as console_script

b3b0cdd

Minimal click CLI

d129d4f

Sets are not hashable

3d4905e

hhsearch hits cached as ASCII

aeebd0a

Click command for jackhmmer_mgnify

9287a9e

Working cache

6597b97

Change console script to af2

5f92c04

Interface tweaks

7b93aca

Properly order kwarg keys

684a8a3

Expose cache directory as env var

ef304a7

cache_to_pckl in its own module

74f001a

Catch NoneType kwargs

307b099

Logging for click CLI commands

5da7dd2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First pass at modular DataPipeline #1

First pass at modular DataPipeline #1

ethho commented Aug 23, 2021 •

edited

Loading

ethho commented Aug 23, 2021 •

edited

Loading

ethho commented Aug 23, 2021

First pass at modular DataPipeline #1

Are you sure you want to change the base?

First pass at modular DataPipeline #1

Conversation

ethho commented Aug 23, 2021 • edited Loading

ethho commented Aug 23, 2021 • edited Loading

Usage

Clone Ethan's Fork

Running Entire DataPipeline

Running One Step of the DataPipeline

Output Caching

ethho commented Aug 23, 2021

Testing

Coverage on Systems

Expected Behavior

ethho commented Aug 23, 2021 •

edited

Loading

ethho commented Aug 23, 2021 •

edited

Loading