Skip to content

Latest commit

 

History

History
831 lines (662 loc) · 39.9 KB

pipeline_documentation.md

File metadata and controls

831 lines (662 loc) · 39.9 KB

ZARP: workflow documentation

This document describes the individual steps of the workflow. For instructions on installation and usage please see here.

Table of Contents

Third-party software used

Tag lines were taken from the developers' websites (code repository or manual)

Name License Tag line More info
ALFA MIT "Annotation Landscape For Aligned reads" - "[...] provides a global overview of features distribution composing NGS dataset(s)" code / manual / publication
bedGraphToBigWig MIT "Convert a bedGraph file to bigWig format" code / manual
bedtools GPLv2 "[...] intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF" code / manual
cutadapt MIT "[...] finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads" code / manual / publication
gffread MIT "[...] validate, filter, convert and perform various other operations on GFF files" code / manual
Entrez Direct custom "[...] an advanced method for accessing the NCBI's set of interconnected databases from a UNIX terminal window" code / manual / publication
FastQC GPLv3 "A quality control analysis tool for high throughput sequencing data" code / manual
ImageMagick custom^ "[...] create, edit, compose, or convert bitmap images" code / manual
kallisto BSD-2 "[...] program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads" code / manual / publication
MultiQC GPLv3 "Aggregate results from bioinformatics analyses across many samples into a single report" code / manual / publication
pigz custom "[...] parallel implementation of gzip, is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data" code / manual
RSeqC GPLv3 "[...] comprehensively evaluate different aspects of RNA-seq experiments, such as sequence quality, GC bias, polymerase chain reaction bias, nucleotide composition bias, sequencing depth, strand specificity, coverage uniformity and read distribution over the genome structure." code / manual / publication
Salmon GPLv3 "Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment" code / manual / publication
SAMtools MIT "[...] suite of programs for interacting with high-throughput sequencing data" code / manual / publication
SRA Tools custom "[...] collection of tools and libraries for using data in the INSDC Sequence Read Archives" code / manual
STAR MIT "Spliced Transcripts Alignment to a Reference" - "RNA-seq aligner" code / manual / publication

^ compatible with GPLv3

Description of workflow steps

The workflow consists of three Snakemake files: A main Snakefile and an individual Snakemake file for each sequencing mode (single-end and paired-end), as parameters for some tools differ between sequencing modes. The main Snakefile contains general steps for the creation of indices and other required files derived from the annotations, steps that are applicable to both sequencing modes, and steps that deal with gathering, summarizing or combining results. Individual steps of the workflow are described briefly, and links to the respective software manuals are given. Parameters that can be modified by the user (via the samples table) are also described. Descriptions for steps for which individual "rules" exist for single- and paired-end sequencing libraries are combined, and only differences between the modes are highlighted.

Rule graph

rule_graph

Visual representation of workflow. Automatically prepared with Snakemake.

Preparatory

Read sample table

Requirements
  • tab-separated values (.tsv) file
  • first row has to contain parameter names as in samples.tsv
  • first column used as sample identifiers
Parameter name Description Data type(s)
sample Descriptive sample name.
NOTE: samples split in multiple fastq files (multilane samples), can be automatically merged by using the same ID
str
seqmode There are two allowed values pe (paired-end) and se (single-end) according to the protocol used. str
fq1 Path of library file in .fastq.gz format (or mate 1 read file for paired-end libraries). str
fq2 Path of mate 2 read file in .fastq.gz format. Value ignored for for single-end libraries. str
fq1_3p Required for Cutadapt. 3' adapter of mate 1. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. str
fq1_5p Required for Cutadapt. 5' adapter of mate 1. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. str
fq2_3p Required for Cutadapt. 3' adapter of mate 2. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. Value ignored for single-end libraries. str
fq2_5p Required for Cutadapt. 5' adapter of mate 2. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. Value ignored for single-end libraries. str
fq1_polya_3p Required for Cutadapt. Stretch of As or Ts, depending on read orientation. Trimmed from the 3' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. str
fq1_polya_5p Required for Cutadapt. Stretch of As or Ts, depending on read orientation. Trimmed from the 5' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. str
fq2_polya_3p Required for Cutadapt. Stretch of As or Ts, depending on read orientation. Trimmed from the 3' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries. str
fq2_polya_5p Required for Cutadapt. Stretch of As or Ts, depending on read orientation. Trimmed from the 5' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries. str
index_size Required for STAR. Ideally the maximum read length minus 1. (max(ReadLength)-1). Values lower than maximum read length may result in lower mapping accuracy, while higher values may result in longer processing times. int
kmer Required for Salmon. Default value of 31 usually works fine for reads of 75 bp or longer. Consider using lower values if poor mapping is observed. int
organism Name or identifier of organism or organism-specific genome resource version. Has to correspond to the naming of provided genome and gene annotation files and directories, like "ORGANISM" in the path below.
Example: GRCh38
str
gtf Required for STAR. Path to gene annotation .gtf file. File needs to be in subdirectory corresponding to organism field.
Example: /path/to/GRCh38/gene_annotations.gtf
str
genome Required for STAR. Path to genome .fa file. File needs to be in subdirectory corresponding to organism field.
Example: /path/to/GRCh38/genome.fa
str
sd Required for kallisto and Salmon, but only for single-end libraries. Estimated standard deviation of fragment length distribution. Can be assessed from, e.g., BioAnalyzer profiles int
mean Required for kallisto and Salmon, but only for single-end libraries. Estimated mean of fragment length distribution. Can be assessed, e.g., from BioAnalyzer profiles int
libtype Required for Salmon, and, after internal conversion, for kallisto and ALFA. See Salmon manual for allowed values.
WARNING: do NOT use A to automatically infer the salmon library type, this will cause kallisto and ALFA to fail.
str

Create log directories

Sets up logging directories for the workflow run environment. Vanilla Python statement, not a Snakemake rule.

Sequencing mode-independent

start

Copy and rename read files.

Local rule

create_index_star

Create index for STAR short read aligner.

Indices need only be generated once for each combination of genome, set of annotations and index size.

  • Input
    • Genome sequence file (.fasta)
    • Gene annotation file (.gtf)
  • Parameters
    • samples.tsv
      • --sjdbOverhang: maximum read length - 1; lower values may reduce accuracy, higher values may increase STAR runtime; specify in sample table column index_size
  • Output

sort_gtf

Sort provided gtf by chromosome, start and end coordinates.

This process is executed once for the provided gtf annotation file.

extract_transcriptome

Create transcriptome from genome and gene annotations with gffread.

concatenate_transcriptome_and_genome

Concatenate reference transcriptome and genome.

Required by Salmon

create_index_salmon

Create index for Salmon quantification.

Required if Salmon is to be used in "mapping-based" mode. create_index_salmon Index is built using an auxiliary k-mer hash over k-mers of a specified length (default: 31). While the mapping algorithms will make use of arbitrarily long matches between the query and reference, the selected k-mer size will act as the minimum acceptable length for a valid match. Thus, a smaller value of k may slightly improve sensitivty. Empirically, the default value of k = 31 seems to work well for reads of 75 bp or longer. For shorter reads, consider using a smaller k.

create_index_kallisto

Create index for kallisto quantification.

The default kmer size of 31 is used in this workflow and is not configurable by the user.

extract_transcripts_as_bed12

Convert transcripts from .gtf to extended 12-column .bed format with custom-script. Note that the default transcript type setting is used, which is "protein_coding".

fastqc

Prepare quality control report for reads library with FastQC.

  • Input
    • Reads file (.fastq.gz); from start
  • Output
    • FastQC output directory with report (.txt) and figures (.png); used in multiqc_report

fastqc_trimmed

Prepare quality control report for trimmed reads (after adapter and poly(A)-tail removal) with FastQC.

sort_genomic_alignment_samtools

Sort BAM file with SAMtools.

Sort a genome aligned BAM file.

index_genomic_alignment_samtools

Index BAM file with SAMtools.

Indexing a genome sorted BAM file allows one to quickly extract alignments overlapping particular genomic regions. Moreover, indexing is required by genome viewers such as IGV so that the viewers can quickly display alignments in a genomic region of interest.

star_rpm

Create stranded bedGraph coverage (.bg) with STAR.

Uses STAR's RPM normalization functionality.

STAR RPM uses SAM flags to correctly tell where the read and its mate mapped to. That is, if mate 1 is mapped to the plus strand, then mate 2 is mapped to the minus strand, and STAR will count mate 1 and mate 2 to the plus strand. This is in contrast to bedtools genomecov -bg -split, where a mate is assigned to a strand irrespective of its corresponding mate.

rename_star_rpm_for_alfa

Rename and copy stranded bedGraph coverage tracks such that they comply with ALFA.

Local rule

Renaming to plus.bg and minus.bg depends on library orientation, which is provided by user in sample table column libtype.

sort_bed_4_big

Sort bedGraph files with bedtools.

prepare_bigWig

Generate bigWig from bedGraph with bedGraphToBigWig.

  • Input
  • Output
    • Coverage files, one per strand and sample (.bw); used in finish

calculate_TIN_scores

Calculates the Transcript Integrity Number (TIN) for each transcript with custom script based on RSeqC.

TIN is conceptually similar to RIN (RNA integrity number) but provides transcript-level measurement of RNA quality and is more sensitive for low-quality RNA samples.

  • TIN score of a transcript reflects RNA integrity of the transcript
  • Median TIN score across all transcripts reflects RNA integrity of library
  • TIN ranges from 0 (the worst) to 100 (the best)
  • A TIN of 60 means that 60% of the transcript would have been covered if the read coverage were uniform
  • A TIN of 0 will be assigned if the transcript has no coverage or coverage is below threshold

salmon_quantmerge_genes

Merge gene-level expression estimates for all samples with Salmon.

Rule is run once per sequencing mode

salmon_quantmerge_transcripts

Merge transcript-level expression estimates for all samples with Salmon.

Rule is run once per sequencing mode

kallisto_merge_genes

Merge gene-level expression estimates for all samples with custom script.

Rule is run once per sequencing mode

  • Input
  • Output
    • Gene TPM table (custom .tsv)
    • Gene read count table (custom .tsv)
    • Mapping gene/transcript IDs table (custom .tsv)
  • Non-configurable & non-default
    • -txOut FALSE: gene-level summarization (default would be transcript level)

kallisto_merge_transcripts

Merge transcript-level expression estimates for all samples with custom script.

Rule is run once per sequencing mode

  • Input
  • Output
    • Transcript TPM table (custom .tsv)
    • Transcript read count table (custom .tsv)

pca_kallisto

Run PCA analysis on kallisto genes and transcripts with custom script.

Rule is run one time for transcript estimates and one time for genes estimates

  • Input
    • Transcript/Genes TPM table (custom .tsv)
  • Output
    • Directory with PCA plots, scree plot and top loading scores.

pca_salmon

Run PCA analysis on salmon genes and transcripts with custom script.

Rule is run one time for transcript estimates and one time for genes estimates

  • Input
    • Transcript/Genes TPM table (custom .tsv)
  • Output
    • Directory with PCA plots, scree plot and top loading scores.

generate_alfa_index

Create index for ALFA.

  • Input
  • Output
    • ALFA index; stranded and unstranded; used in alfa_qc

alfa_qc

Annotate alignments with ALFA.

For details on output plots, see ALFA documentation.
Note: the read orientation of a sample will be inferred from salmon libtype specified in samples.tsv

prepare_multiqc_config

Prepare config file for MultiQC.

Local rule

  • Input
  • Parameters All parameters for this rule have to be specified in main config.yaml
    • --intro-text
    • --custom-logo
    • --url
  • Output

multiqc_report

Prepare interactive report from results and logs with MultiQC.

finish

Target rule as required by Snakemake.

Local rule

Sequencing mode-specific

Steps described here have two variants, one with the specified names for samples prepared with a single-end sequencing protocol, one with pe_ prepended to the specified name for samples prepared with a paired-end sequencing protocol.

remove_adapters_cutadapt

Remove adapter sequences from reads with Cutadapt.

  • Input

    • Reads file (.fastq.gz); from start
  • Parameters

    • samples.tsv
      • Adapters to be removed; specify in sample table columns fq1_3p, fq1_5p, fq2_3p, fq2_5p
    • rule_config.yaml:
      • -m 10: Discard processed reads that are shorter than 10 nt. If specified in rule_config.yaml, it will override ZARP's default value of m=1 for this parameter. Note that this is different from cutadapt's default behavior (m=0), which leads to empty reads being retained, causing problems in downstream applications in ZARP. We thus strongly recommend to not set the value of m to 0! Refer to cutadapt's documentation for more information on the m parameter.
      • -n 2: search for all the given adapter sequences repeatedly, either until no adapter match was found or until 2 rounds have been performed. (default 1)
  • Output

remove_polya_cutadapt

Remove poly(A) tails from reads with Cutadapt.

  • Input
  • Parameters
    • samples.tsv
      • Poly(A) stretches to be removed; specify in sample table columns fq1_polya and fq2_polya
    • rule_config.yaml
      • -m 10: Discard processed reads that are shorter than 10 nt. If specified in rule_config.yaml, it will override ZARP's default value of m=1 for this parameter. Note that this is different from cutadapt's default behavior (m=0), which leads to empty reads being retained, causing problems in downstream applications in ZARP. We thus strongly recommend to not set the value of m to 0! Refer to cutadapt's documentation for more information on the m parameter.
      • -O 1: minimal overlap of 1 (default: 3)
  • Output

map_genome_star

Align short reads to reference genome and/or transcriptome with STAR.

  • Input
  • Parameters
    • rule_config.yaml
      • --outFilterMultimapScoreRange=0: the score range below the maximum score for multimapping alignments (default 1)
      • --outFilterType=BySJout: reduces the number of ”spurious” junctions
      • --alignEndsType: one of Local (standard local alignment with soft-clipping allowed) or EndToEnd (force end-to-end read alignment, do not soft-clip); specify in sample table column soft_clip
      • --twopassMode: one of None (1-pass mapping) or Basic (basic 2-pass mapping, with all 1st-pass junctions inserted into the genome indices on the fly); specify in sample table column pass_mode
      • --outFilterMultimapNmax: maximum number of multiple alignments allowed; if exceeded, read is considered unmapped; specify in sample table column multimappers
  • Output
  • Non-configurable & non-default
    • --outSAMattributes=All: NH HI AS nM NM MD jM jI MC ch
    • --outStd=BAM_Unsorted: which output will be directed to STDOUT (default 'Log')
    • --outSAMtype=BAM Unsorted: type of SAM/BAM output (default SAM)
    • --outSAMattrRGline: ID:rnaseq_pipeline SM: sampleID

quantification_salmon

Estimate transcript- and gene-level expression with Salmon.

  • Input
  • Parameters
    • samples.tsv
      • libType: see Salmon manual for allowed values; specify in sample table column libtype
      • --fldMean: mean of distribution of fragment lengths; specify in sample table column mean (single-end only)
      • --fldSD: standard deviation of distribution of fragment lengths; specify in sample table column sd (single-end only)
    • rule_config.yaml
      • --seqBias: correct for sequence specific biases
      • --validateMappings: enables selective alignment of the sequencing reads when mapping them to the transcriptome; this can improve both the sensitivity and specificity of mapping and, as a result, can improve quantification accuracy.
      • --writeUnmappedNames: write out the names of reads (or mates in paired-end reads) that do not map to the transcriptome. For paired-end this gives flags that indicate how a read failed to map (paired-end only)
  • Output

genome_quantification_kallisto

Generate pseudoalignments of reads to transcripts with kallisto.

Note: the kallisto strandedness parameter will be inferred from salmon libtype specified in samples.tsv

  • Input
  • Parameters
    • samples.tsv
      • -l: mean of distribution of fragment lengths; specify in sample table column mean (single-end only)
      • -s: standard deviation of distribution of fragment lengths; specify in sample table column sd (single-end only)
  • Output
  • Non-configurable & non-default
    • --single: Quantify single-end reads (single-end only)
    • --pseudobam: Save pseudoalignments to transcriptome to BAM file

Description of SRA download workflow steps

This separate workflow consists of three Snakemake files: A main sra_download.smk and an individual Snakemake file for each sequencing mode (single-end and paired-end), as parameters for some tools differ between sequencing modes. The main sra_download.smk contains general steps for downloading the samples from the SRA repository and determining the sequencing mode in order to execute the appropriate subsequent rules.Individual steps of the workflow are described briefly, and links to the respective software manuals are given. Parameters that can be modified by the user (via the samples table) are also described. Descriptions for steps for which individual "rules" exist for single- and paired-end sequencing libraries are combined, and only differences between the modes are highlighted.

SRA Sequencing mode-independent

get_layout

Get the library type of each sample (paired or single-end) using efetch Entrez direct.

  • Output
    • A file with a name either PAIRED or SINGLE which is used downstream to run the appropriate subworkflow.

prefetch

Download the SRA entry using SRA Tools

add_fq_file_path

Aggregate the fastq file(s) path(s) in a table for all samples.

SRA Sequencing mode-specific

fasterq_dump

Converts SRA entry to fastq file(s) using SRA Tools

compress_fastq

Compresses fastq file(s) to .gz format. pigz

process_fastq

Keep the fastq.gz file path in a table, later aggregated in one table in add_fq_file_path.