Pipelines in this repository are composed with Snakemake v5.23.0. Compatibility with different versions is not guaranteed.
Clone this repository with git clone https://github.com/van-der-knaap-lab/snakemato.git
. On GACRC's Sapelo2, you will need to load the Git module first with ml load git/2.23.0-GCCcore-8.3.0-nodocs
.
You will first need a conda environment with Snakemake installed. To create one on Sapelo2:
module load Miniconda3/4.7.10
conda create --name <myenv>
source activate <myenv>
conda install -c conda-forge -c bioconda snakemake
If an appropriate conda environment already exists, just run:
module load Miniconda3/4.7.10
source activate <myenv>
The Snakefile.variants
pipeline involves the following steps:
- download fastq files
- merge fastq files for each accession
- trim fastq files with Trimmomatic
- align reads with a reference genome with SpeedSeq
- sort the aligned sequence with SAMtools
- create read groups with Picard
- mark duplicates with Picard
- call variants with LUMPY
- call variants with GATK HaplotypeCaller
- combine variants returned from HaplotypeCaller
- joint genotype then select and filter variants with GATK
On GACRC's Sapelo2, pipelines must run in the scratch filesystem (/scratch/<user>
), otherwise filesystem quotas will quickly be exceeded.
This pipeline accepts a config file (e.g., config.json
) which must be located in the same directory. The config file should conform to the following schema:
{
"reference": "<reference genome>",
"accessions": {
"<accession 1>": [
"<run 1>",
"<run 2>"
],
"<accession 2>": [
"<run 1>",
"<run 2>"
]
}
}
For instance, to run 2 accessions (each with a variable number of sequencer runs) against the "PAS014479_MAS1.0" reference genome:
{
"reference": "PAS014479_MAS1.0",
"accessions": {
"BGV012640": [
"SRR7279605"
],
"PI487625": [
"SRR7279526",
"SRR7279658"
]
}
}
Note that the reference genome file (e.g., PAS014479_MAS1.0.fasta
) must exist in the same directory.
To execute a dry run (does not actually do anything, just prints out rules scheduled for execution):
snakemake --snakefile Snakefile.variants -np --configfile "<config file>"
You probably won't want to run the pipeline interactively unless your dataset is very small and you have a tmux or screen session open, as the pipeline will take several hours at minimum — but if you do, run:
snakemake --snakefile Snakefile.variants --latency-wait 30 --restart-times 2 --configfile "<config file>" --jobs 500 --cluster "qsub -l walltime={params.walltime} -l nodes={params.nodes}:ppn={params.ppn} -l mem={params.mem} -M <your email address> -m ae"
Several command line arguments are passed to Snakemake:
--latency-wait
: seconds to wait after each rule completes before checking for expected output files (compensates for filesystem latency)--restart-times
: times to restart each rule if it fails--jobs
: maximum number of concurrent jobs submitted to the cluster scheduler--cluster
: the command to use when submitting each job to the cluster scheduler; parameters (e.g.,params.walltime
) are specified inSnakefile
on a per-rule basis
These options can be reconfigured as needed. Note that if the pipeline has run previously in the same directory, you may need to execute a dry run with extra flag --unlock
to release the directory lock before rerunning. Alternatively use --nolock
to ignore directory locks.
To submit the pipeline to Sapelo2 as a batch job, simply embed the command above in a job submission script (several examples follow).
Depending on the size of your dataset you may need to tune cluster resources on a per-rule basis.
Torque/Moab (PBS)
#PBS -S /bin/bash
#PBS -N Snakemato
#PBS -q batch
#PBS -l nodes=1:ppn=1
#PBS -l walltime=150:00:00
#PBS -l mem=20gb
#PBS -M <your email address>
#PBS -m ae
cd $PBS_O_WORKDIR
module load Miniconda3/4.7.10
source activate <your conda environment>
ulimit -c unlimited
snakemake --snakefile Snakefile.variants --latency-wait 30 --restart-times 2 --configfile "config.json" --jobs 500 --cluster "qsub -l walltime={params.walltime} -l nodes={params.nodes}:ppn={params.ppn} -l mem={params.mem} -M <your email address> -m ae"
Slurm
TODO
The Snakefile.depths
pipeline calculates average read depths for a collection of .bam
files.
This pipeline accepts a configuration file with schema:
{
"accessions": [
"<accession 1>",
"<accession 2>"
]
}
The pipeline expects .bam
files in the working directory. Files should be named <accession>.bam
.
To execute a dry run (does not actually do anything, just prints out rules scheduled for execution):
snakemake --snakefile Snakefile.depths -np --configfile "<config file>"
To execute the pipeline interactively (this should be done from an interactive job):
snakemake --snakefile Snakefile.depths --latency-wait 30 --restart-times 2 --configfile "<config file>" --jobs 500 --cluster "qsub -l walltime={params.walltime} -l nodes={params.nodes}:ppn={params.ppn} -l mem={params.mem} -M <your email address> -m ae"
Several command line arguments are passed to Snakemake:
--latency-wait
: seconds to wait after each rule completes before checking for expected output files (compensates for filesystem latency)--restart-times
: times to restart each rule if it fails--jobs
: maximum number of concurrent jobs submitted to the cluster scheduler--cluster
: the command to use when submitting each job to the cluster scheduler; parameters (e.g.,params.walltime
) are specified inSnakefile
on a per-rule basis
These options can be reconfigured as needed. Note that if the pipeline has run previously in the same directory, you may need to execute a dry run with extra flag --unlock
to release the directory lock before rerunning. Alternatively use --nolock
to ignore directory locks.
#PBS -S /bin/bash
#PBS -N depths
#PBS -q batch
#PBS -l nodes=1:ppn=1
#PBS -l walltime=24:00:00
#PBS -l mem=20gb
#PBS -M <your email address>
#PBS -m ae
cd $PBS_O_WORKDIR
module load Miniconda3/4.7.10
source activate <your conda environment>
ulimit -c unlimited
snakemake --snakefile Snakefile.depths --configfile depths_config.json --latency-wait 30 --restart-times 2 --jobs 500 --cluster "qsub -l walltime={params.walltime} -l nodes={params.nodes}:ppn={params.ppn} -l mem={params.mem} -M [email protected] -m ae"
Various utilities carried over from previous collaborator Nathan.