Calculating proportion of reads mapping to mitochondrial transcripts

The determination of mitochondrial ratios is a bit more complex and requires us to use genome annotations to determine which genes originated from the mitochondrial DNA.

Using annotation file to generate mitochondrial count metrics

We will be using AnnotationHub, which allows accession to a wide variety of online databases and other resources, to query Ensembl annotations made available through ensembldb. Ensembldb is a package that retrieves annotation for the databases directly from Ensembl.

Downloading database for organism of interest

To access the various annotations available from Ensembl for human, we need to first connect to AnnotationHub, then specify the organism and database we are interested in.

# Connect to AnnotationHub
ah <- AnnotationHub()

# Access the Ensembl database for organism
ahDb <- query(ah, 
              pattern = c("Homo sapiens", "EnsDb"), 
              ignore.case = TRUE)

Next, we acquire the latest annotation files from this Ensembl database.

We can first check which annotation versions are available:

# Check versions of databases available
ahDb %>% 
  mcols()

Since we want the most recent, we will return the AnnotationHub ID for this database:

# Acquire the latest annotation files
id <- ahDb %>%
  mcols() %>%
  rownames() %>%
  tail(n = 1)

Finally, we can use the AnnotationHub connection to download the appropriate Ensembl database, which should be version GRCh38.92.

# Download the appropriate Ensembldb database
edb <- ah[[id]]

And to extract gene-level information we can use the Ensembldb function genes() to return a data frame of annotations.

# Extract gene-level information from database
annotations <- genes(edb, 
                     return.type = "data.frame")

Exracting IDs for mitochondrial genes

We aren't interested in all of the information present in this annotations file, so we are going to extract that which is useful to us.

# Select annotations of interest
annotations <- annotations %>%
  dplyr::select(gene_id, gene_name, gene_biotype, seq_name, description, entrezid)
                         
View(annotations)

Now we can retrieve the genes associated with the different biotypes of interest:

# Extract IDs for mitochondrial genes
mt <- annotations %>%
        dplyr::filter(seq_name == "MT") %>%
        dplyr::pull(gene_name)

Adding metrics to metadata

Now that we have information about which genes are mitochondrial, we can quanitify whether we have contamination.

# Number of UMIs assigned to mitochondrial genes
metadata$mtUMI <- Matrix::colSums(counts[which(rownames(counts) %in% mt),], na.rm = T)

# Calculate of mitoRatio per cell
metadata$mitoRatio <- metadata$mtUMI/metadata$nUMI

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mitoRatio.md

mitoRatio.md

Calculating proportion of reads mapping to mitochondrial transcripts

Using annotation file to generate mitochondrial count metrics

Downloading database for organism of interest

Exracting IDs for mitochondrial genes

Adding metrics to metadata

Files

mitoRatio.md

Latest commit

History

mitoRatio.md

File metadata and controls

Calculating proportion of reads mapping to mitochondrial transcripts

Using annotation file to generate mitochondrial count metrics

Downloading database for organism of interest

Exracting IDs for mitochondrial genes

Adding metrics to metadata