Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Database cleanup pipeline #964

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

nwillhoft
Copy link
Contributor

@nwillhoft nwillhoft commented Oct 10, 2024

[DRAFT - Nextflow pipeline to clear up databases].

JIRA ticket: https://www.ebi.ac.uk/panda/jira/browse/ENSCORESW-4404

Description

A nextflow pipeline to export database SQL to file and store it. The source database can optionally be deleted if required to free up storage space.

Use case

If a db host contains any old/unused dbs, this pipeline can be used to dump out their SQL, put the files in a convenient place and remove the db.

Example of how to run the pipeline:

# set up environment
module load nextflow
salloc -t 04:00:00 --mem=8G -p debug
export NEXTFLOW_DIR=/hps/software/users/ensembl/infrastructure/nwillhoft/ensembl-production/
export DATA_DIR=/hps/nobackup/flicek/ensembl/infrastructure/nwillhoft/
cd $DATA_DIR/
source /hps/software/users/ensembl/ensw/swenv/initenv default
pyenv activate production-tools

# help message
nextflow run /hps/software/users/ensembl/infrastructure/nwillhoft/ensembl-production/nextflow/workflows/db_cleanup/main.nf --help

# set up your config file, update email address on command line and run
# NB. please test first with `drop_source_db` set to false in config (and when happy feel free to change to true)
# NB. please see note below and test on a single db to start with
nextflow run $NEXTFLOW_DIR/nextflow/workflows/db_cleanup/main.nf -N <email>@ebi.ac.uk

Benefits

This pipeline will make it easier to automate the removal of old dbs.

Possible Drawbacks

Not an intended drawback but running the pipeline on more than 1 db at a time appears to cause a bottleneck in the dbcopy-client processing. This needs testing out further as the pipeline is set up to process everything in parallel to be as efficient as possible. To give an example, I tried copying over 3 dbs from st6 to core-prod-1 and it took over 24 hours to perform only the copy step. Whereas if I try coping 1 db at a time, it typically takes around an hour or less for this step.

Testing

  • Have you added/modified unit tests to test the changes? Tests so far are with nf-schema to validate parameters
  • If so, do the tests pass? N/A
  • Have you run the entire test suite and no regression was detected? No
  • TravisCI passed on your branch. Python 3.7 build passes. Python 3.8 and Perl builds are erroring. Perl 5.14 seems to be erroring due to perl module installation issues.

Dependencies

If applicable, define what code dependencies were added and/or updated.

The only external code dependency is using plugin/nf-schema within nextflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant