Skip to content

SAMTOOLS Output Formats

Hannes Hauswedell edited this page Jun 22, 2023 · 8 revisions

Since SAM and BAM are originally not designed for local alignments, especially of protein sequences, this document describes Lambda's implementation of the standard.

Please see the official specification if some of the terms used here are not clear to you.

column use in Lambda
QNAME name of the query sequence, truncated at first whitespace
FLAG bit 16 and bit 256 implemented in a standard conform way
RNAME name of the subject sequence, truncated at first whitespace
POS begin position of alignment on subject sequence; begin position on original untranslated DNA sequence for TBlastN, TBlastX, end position if negative strand; begin position on protein sequence for BlastP, BlastX
MAPQ 255
CIGAR query DNA cigar (untranslated DNA sequence for BlastX, TBlastX); * for BlastP, TBlastN; reversed if negative strand/frame
RNEXT *
PNEXT 0
TLEN 0
SEQ query DNA sequence (untranslated DNA sequence for BlastX, TBlastX); * for BlastP, TBlastN; reverse-complemented if negative strand/frame; see below for clipping
QUAL *
OPT see below

Sequence strings

Following the recommendations of the specification the SEQ field is only written, if it is different from the previous line's SEQ field. This can be changed via Lambda's command line parameter --sam-bam-seq which can be set to always or never (the latter saves more space). This behaviour also applies to the qs tag defined below.

Clipping

Via the --sam-bam-clip parameter you can chose between hard-clipping and soft-clipping. Soft-clipping will result in full sequences in the SEQ and qs fields while hard-clipping will only show the locally matching part. Depending on that the CIGAR strings will also contain H or S characters. Hard-clipping is the default, because it takes up less space.

Please be aware that if the query sequence is translated, those DNA positions that are lost because frame-shifts or incomplete frames (at the end of a sequence) are always hard-clipped. These positions are also not represented in the protein cigar (see the qs tag below).

Optional tags

tag description
official
AS bit score
OC query protein cigar (* for BLASTN)
NM edit distance (in protein space unless BLASTN)
IH number of matches this query has
regarding the alignment
ae expect value
ar raw score
ai % identity (in protein space unless BLASTN)
ap % positive (in protein space unless BLASTN)
regarding the query sequence
qf query frame
qs query protein sequence (* for BLASTN)
regarding the subject sequence
sf subject frame
st subject taxonomy ID(s) separated by ; (see Taxonomic Workflows)
regarding all matches of this query
ls lowest common ancestor scientific name (see Taxonomic Workflows)
lt lowest common ancestor taxonomy id (see Taxonomic Workflows)

These tags can be specified with the command line argument --sam-bam-tags. If you would like to see any other tags supported, please don't hesitate to contact us.

Header

BAM files require all subject names to be written to the header. For SAM this is not required, so Lambda does not automatically do it to save space (especially for protein database this is a lot!). If you still want them with SAM, e.g. for better BAM compatibility, use the --sam-with-refheader option.