Skip to content

Latest commit

 

History

History
201 lines (120 loc) · 11.4 KB

align.md

File metadata and controls

201 lines (120 loc) · 11.4 KB

Sequence alignment

Woltka takes sequence alignment files as input and generate feature tables. This means that you need to prepare alignments BEFORE running Woltka. The way you want your sequences to be aligned is highly flexible, and can be customized according to your research goals. This page provides general guidelines and detailed examples.

Contents

Sequencing data

This will be your sequencing data generated from microbiome samples. They can be DNA (metagenome), RNA (metatranscriptome), or other types of data. They are usually in the file format of FastQ or Fasta, but other formats may also apply.

  • Note: Woltka is designed for whole-(meta)genome shotgun sequencing data. Technically, you can analyze 16S rRNA data using Woltka, but I don't see the rationale. Please instead learn about QIIME 2.

Proper quality control should be performed prior to sequence alignment. Here is an introduction. Also see our latest benchmarks: Armstrong et al. (2022).

Reference database

The sequence data should be aligned to a reference database. It can be a database of reference genomes. Or, it can be a database of reference genes (nucleotides or proteins). The latter scenario is less typical for Woltka, but it works.

Our WoL database can be downloaded from this Dropbox site. wolr1_seqs is the genome sequences (recommended). wolr1_orfs is the re-annotated ORF sequences (amino acids). This database is generally useful for microbiome studies.

We have also tested multiple other databases, including GEBA-I, UHGG, MIDAS, GTDB (we have a dedicated tutorial for GTDB), and GEM. The results were reported in Zhu et al. (2022).

In summary, Woltka works with any database. The choice of database should depend on your research scope, and technical feasibility.

Alignment with Bowtie2

Woltka has been most extensively tested and validated on sequence alignments generated by Bowtie2 (Langmead and Salzberg, 2012), an ultrafast short sequence aligner that has been widely adopted in many applications.

Database indexing

Let's say your reference genome sequences are in a Fasta-formatted file db.fna. Before running Bowtie2 alignment, you need to index the reference genome database (8 is the number of CPU cores you plan to allocate, same below):

bowtie2-build --threads 8 db.fna db

This takes quite a while, and it could cost a substantial amount of memory, especially for large databases (many reference genomes). So do it on a supercomputer. The good news is that you just need to do it once.

For example, our WoL database, which has 10,575 reference genomes, took nearly 9 hours to build and had a peak memory consumption of 190 GB. The indexed database costs 54.5 GB disk space.

Below is the benchmark we generated for your reference (disclaimer: we make no warranty to its accuracy).

Bowtie2 indexing

Sequence alignment

Then you can align your sequencing data against the database. Let's assume you have paired-end sequences in FastQ files R1.fq and R2.fq for each sample, or for the entire multiplexed dataset. The basic alignment command is as simple as:

bowtie2 -p 8 -x db -1 R1.fq -2 R2.fq -S output.sam

The output file output.sam is an alignment file in SAM format. This will be the input file for Woltka.

As a rule of thumb, the --very-sensitive flag is recommended, because we want to maximize discovery of diverse microbes in the sample (instead for pursuing accurate alignment to the human genome).

In addition, we can suppress SAM header (--no-head) and unaligned sequences (--no-unal) to reduce the size of output files. Woltka won't need these. But in case you need them for other applications, you may skip these flags.

Moreover, we can crop out the aligned sequences and scores (again, don't do it if you need these). Bowtie2 doesn't have this function, but it can be achieved using Linux commands cut -f1-9 and sed 's/$/\t*\t*/'.

Finally, we can compress the SAM files using gzip to further save disk space.

So the command becomes (recommended):

bowtie2 -p 8 -x db -1 R1.fq -2 R2.fq --very-sensitive --no-head --no-unal | cut -f1-9 | sed 's/$/\t*\t*/' | gzip > output.sam.gz

The alignment step costs much less memory compared with the database indexing step. Below is our benchmark on the HMP metagenome dataset against our WoL database (disclaimer: we make no warranty to its accuracy).

  • From the benchmark we learned that Bowtie2 scales well with several CPU cores (8 is a good number). But it perhaps makes less sense to use more than 16 CPU cores for a single task.

Bowtie2 alignment

It maybe worth checking out the Bowtie2 manual and this performing tuning guide on how to optimize the Bowtie2 framework.

The SHOGUN protocol

SHOGUN (Hillmann et al., 2018, 2020) is a metagenomics package developed by our collaborators and us. It wraps up sequence aligners such as Bowtie2. The team did benchmarks and found that the following additional Bowtie2 parameters are suitable for the task of metagenomics. Our team also validated this in a separate work (Zhu et al., 2022).

  • In brief, this means that Bowtie2 will return up to 16 hits that have 95% or higher sequence identity with each query sequence. You may consider adding them to the command above.
-k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05"

So the entire command becomes:

bowtie2 -p 8 -x db -1 R1.fq -2 R2.fq --very-sensitive --no-head --no-unal -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05" | cut -f1-9 | sed 's/$/\t*\t*/' | gzip > output.sam.gz

Additionally, the original SHOGUN program only works with unpaired sequencing reads in FASTA format. To fully mimic its behavior, one needs to do:

seqtk mergepe R1.fq R2.fq | seqtk seq -a > input.fa

Then:

bowtie2 -p 8 -x db -f -U input.fa --very-sensitive --no-head --no-unal -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05" | cut -f1-9 | sed 's/$/\t*\t*/' | gzip > output.sam.gz

This is equivalent to the SHOGUN command:

shogun align -d db -a bowtie2 -t 8 -p 0.95 -i input.fa -o .

Other alignment tools

Woltka has no limitation on what alignment tool should be used. Beyond Bowtie2, it is totally okay to use whatever you have already been using, for consistency with other parts of your research project (which is important!). Below are a few tools which we have tested.

Minimap2

Minimap2 (Li, 2018) is a rapid aligner for short and long DNA and mRNA sequences. The default setting is for long sequences (such as PacBio and Nanopore). To make it work the best with short reads (such as Illumina), you need to add the -ax sr parameter.

Database indexing:

minimap2 -ax sr db.fna -d db.mmi

Sequence alignment:

minimap2 -t 8 -ax sr db R1.fq R2.fq -o output.sam

The output file can be shrinked to save space. Like with Bowtie2 outputs, you can strip off sequence, scores and tags from each alignment using cut -f1-9 followed by sed 's/$/\t*\t*/'. Additionally, because Minimap2 cannot suppress SAM header and unaligned sequences (when this document is written), you can achieve this with awk '!/^@/ && $3 != "*"'. (You can also use this trick to shrink existing Bowtie2 outputs.) Therefore the command becomes:

minimap2 -t 8 -ax sr db R1.fq R2.fq | awk '!/^@/ && $3 != "*"' | cut -f1-9 | sed 's/$/\t*\t*/' | gzip > output.sam.gz

Note: Minimap2 by default output PAF format instead of SAM. Woltka is capable of parsing PAF. Therefore the following command also works:

minimap2 -t 8 -x sr db R1.fq R2.fq | gzip > output.paf.gz

Note: For each query sequence, Minimap2 by default returns up to 6 alignments (one primary and five secondary which have at least 80% of the primary score). This behavior can be tweaked using the -p and -N parameters of Minimap2.

BWA

BWA (Li and Durbin, 2009) (Burrows-Wheeler Aligner) is a classical short sequence aligner. It has been used many. Researchers of modern time may want to check out Minimap2 or BWA-MEM2. Anyway, here is the usage:

Database indexing:

bwa index -p db db.fna

Sequence alignment:

bwa mem -t 8 db R1.fq R2.fq > output.sam

BURST

BURST (Al-Ghalith and Knights, 2020) is a fast and optimal sequence aligner that scales well with modern metagenomic datasets while enjoying high accuracy. It is exhaustive, as it guarantees to return the best hits from the entire database (unlike tools such as Bowtie2 which is heuristic). You may consider it as a BLASTn alternative. Not as fast as Bowtie2, though, but a good choice when alignment accuracy is the priority.

Database building (these are the parameters we tested):

burst -t 8 -r db.fna -o db.edx -a db.acx -d DNA 320 -i 0.95 -s 1500

Sequence alignment:

burst -t 8 -r db.edx -a db.acx -q input.fa -o output.b6o -sa -fr -i 0.95

The output file is in BLAST format.

BLASTn

The old-school BLAST algorithm (Altschul et al., 1990) is still viable as long as your database / sequence data are of limited size. Despite being slower than modern algorithms, its alignment accuracy is probably the best.

Database building:

makeblastdb -in db.fa -dbtype nucl -out db -title db

Sequence alignment (parameters are for reference only):

blastn -num_threads 8 -db db -query input.fa -out output.b6o -outfmt 6 -evalue 1e-10 -max_target_seqs 25