Skip to content

Latest commit

 

History

History
303 lines (232 loc) · 13.2 KB

README.md

File metadata and controls

303 lines (232 loc) · 13.2 KB

Ensemblex: an accuracy-weighted ensemble genetic demultiplexing framework for single-cell RNA sequencing

DOI


Contents


Introduction

Ensemblex is an accuracy-weighted ensemble framework for genetic demultiplexing of pooled single-cell RNA seqeuncing (scRNAseq) data. Ensemblex can be used to demultiplex pools with or without prior genotype information. When demultiplexing with prior genotype information, Ensemblex leverages the sample assignments of four individual, constituent genetic demultiplexing tools:

  1. Demuxalot (Rogozhnikov et al. )
  2. Demuxlet (Kang et al. )
  3. Souporcell (Heaton et al. )
  4. Vireo-GT (Huang et al. )

When demultiplexing without prior genotype information, Ensemblex leverages the sample assignments of four individual, constituent genetic demultiplexing tools:

  1. Demuxalot (Rogozhnikov et al. )
  2. Freemuxlet (Kang et al. )
  3. Souporcell (Heaton et al. )
  4. Vireo (Huang et al. )

Upon demultiplexing pools with each of the four constituent genetic demultiplexing tools, Ensemblex processes the output files in a three-step pipeline to identify the most probable sample label for each cell based on the predictions of the constituent tools:

Step 1: Probabilistic-weighted ensemble
Step 2: Graph-based doublet detection
Step 3: Ensemble-independent doublet detection

As output, Ensemblex returns its own cell-specific sample labels and corresponding assignment probabilities and singlet confidence score, as well as the sample labels and corresponding assignment probabilities for each of its constituents. The demultiplexed sample labels could then be used to perform downstream analyses.

To facilitate the application of Ensemblex, we provide a pipeline that demultiplexes pooled cells by each of the individual constituent genetic demultiplexing tools and processes the outputs with the Ensemblex algorithm.

The pipelines comprise of four distinct steps:

  1. Selection of Ensemblex pipeline and establishing the working directory (Set up)
  2. Prepare input files for constituent genetic demultiplexing tools
  3. Genetic demultiplexing by constituent demultiplexing tools
  4. Application of the Ensemblex framework

Below we provide a quick-start guide for using Ensemblex. For comprehensive documentation, please see the Ensemblex site. In the Ensemblex documentation, we outline each step of the Ensemblex pipeline, illustrate how to run the pipeline, define best practices, and provide a tutorial with pubicly available datasets.


Installation

Install the Enseblex container and load Apptainer:

## Download the Ensemblex container
curl "https://zenodo.org/records/11639103/files/ensemblex.pip.zip?download=1" --output ensemblex.pip.zip

## Unzip the Ensemblex container
unzip ensemblex.pip.zip

## Load Apptainer
module load apptainer/1.2.4

To test if the Ensemblex container is installed properly, run the following code:

## Define the path to ensemblex.pip
ensemblex_HOME=/path/to/ensemblex.pip

## Print help message
bash $ensemblex_HOME/launch_ensemblex.sh -h

Which should return the following help message:

------------------- 
Usage:  /home/fiorini9/scratch/ensemblex.pip/launch_ensemblex.sh [arguments]
        mandatory arguments:
                -d  (--dir)  = Working directory (where all the outputs will be printed) (give full path)
                --steps  =  Specify the steps to execute. Begin by selecting either init-GT or init-noGT to establish the working directory. 
                       For GT: vireo, demuxalot, demuxlet, souporcell, ensemblexing 
                       For noGT: vireo, demuxalot, freemuxlet, souporcell, ensemblexing 

        optional arguments:
                -h  (--help)  = See helps regarding the pipeline arguments 
                --vcf  = The path of vcf file 
                --bam  = The path of bam file 
                --sortout  = The path snd nsme of vcf generated using sort  
 ------------------- 
 For a comprehensive help, visit  https://neurobioinfo.github.io/ensemblex/site/ for documentation. 

Step 1: Set up

Demultiplexing pooled cells with prior genotype information

Initiate the pipeline:

## Create and navigate to the working directory
mkdir working_directory
cd /path/to/working_directory

## Define the path to ensemblex.pip
ensemblex_HOME=/path/to/ensemblex.pip

## Define the path to the working directory
ensemblex_PWD=/path/to/working_directory

## Initiate the pipeline for demultiplexing with prior genotype information
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step init-GT

Demultiplexing pooled cells without prior genotype information

Initiate the pipeline:

## Create and navigate to the working directory
mkdir working_directory
cd /path/to/working_directory

## Define the path to ensemblex.pip
ensemblex_HOME=/path/to/ensemblex.pip

## Define the path to the working directory
ensemblex_PWD=/path/to/working_directory

## Initiate the pipeline for demultiplexing without prior genotype information
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step init-noGT

Step 2: Preparation of input files

Demultiplexing pooled cells with prior genotype information

The following files are required:

File Description
gene_expression.bam Gene expression bam file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam)
gene_expression.bam.bai Gene expression bam index file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam.bai)
barcodes.tsv Barcodes tsv file of the pooled cells (e.g., 10X Genomics barcodes.tsv)
pooled_samples.vcf vcf file describing the genotypes of the pooled samples
genome_reference.fa Genome reference fasta file (e.g., 10X Genomics genome.fa)
genome_reference.fa.fai Genome reference fasta index file (e.g., 10X Genomics genome.fa.fai)
genotype_reference.vcf Population reference vcf file (e.g., 1000 Genomes Project)

Prepare the input files:

## Define all of the required files
BAM=/path/to/possorted_genome_bam.bam
BAM_INDEX=/path/to/possorted_genome_bam.bam.bai
BARCODES=/path/to/barcodes.tsv
SAMPLE_VCF=/path/to/pooled_samples.vcf
REFERENCE_VCF=/path/to/genotype_reference.vcf
REFERENCE_FASTA=/path/to/genome.fa
REFERENCE_FASTA_INDEX=/path/to/genome.fa.fai

## Define the path to the working directory
ensemblex_PWD=/path/to/working_directory

## Copy the files to the input_files directory in the working directory
cp $BAM  $ensemblex_PWD/input_files/pooled_bam.bam
cp $BAM_INDEX  $ensemblex_PWD/input_files/pooled_bam.bam.bai
cp $BARCODES  $ensemblex_PWD/input_files/pooled_barcodes.tsv
cp $SAMPLE_VCF  $ensemblex_PWD/input_files/pooled_samples.vcf
cp $REFERENCE_VCF  $ensemblex_PWD/input_files/reference.vcf
cp $REFERENCE_FASTA  $ensemblex_PWD/input_files/reference.fa
cp $REFERENCE_FASTA_INDEX  $ensemblex_PWD/input_files/reference.fa.fai

Demultiplexing pooled cells without prior genotype information

The following files are required:

File Description
gene_expression.bam Gene expression bam file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam)
gene_expression.bam.bai Gene expression bam index file of the pooled samples (e.g., 10X Genomics possorted_genome_bam.bam.bai)
barcodes.tsv Barcodes tsv file of the pooled cells (e.g., 10X Genomics barcodes.tsv)
genome_reference.fa Genome reference fasta file (e.g., 10X Genomics genome.fa)
genome_reference.fa.fai Genome reference fasta index file (e.g., 10X Genomics genome.fa.fai)
genotype_reference.vcf Population reference vcf file (e.g., 1000 Genomes Project)

Prepare the input files:

## Define all of the required files
BAM=/path/to/possorted_genome_bam.bam
BAM_INDEX=/path/to/possorted_genome_bam.bam.bai
BARCODES=/path/to/barcodes.tsv
REFERENCE_VCF=/path/to/genotype_reference.vcf
REFERENCE_FASTA=/path/to/genome.fa
REFERENCE_FASTA_INDEX=/path/to/genome.fa.fai

## Define the path to the working directory
ensemblex_PWD=/path/to/working_directory

## Copy the files to the input_files directory in the working directory
cp $BAM  $ensemblex_PWD/input_files/pooled_bam.bam
cp $BAM_INDEX  $ensemblex_PWD/input_files/pooled_bam.bam.bai
cp $BARCODES  $ensemblex_PWD/input_files/pooled_barcodes.tsv
cp $REFERENCE_VCF  $ensemblex_PWD/input_files/reference.vcf
cp $REFERENCE_FASTA  $ensemblex_PWD/input_files/reference.fa
cp $REFERENCE_FASTA_INDEX  $ensemblex_PWD/input_files/reference.fa.fai

Step 3: Genetic demultiplexing by constituent tools

Demultiplexing pooled cells with prior genotype information

Demultiplex the pooled cells with each of Ensemblex's constituent tools:

## Define the paths to Ensemblex and the working directory 
ensemblex_HOME=/path/to/ensemblex.pip
ensemblex_PWD=/path/to/working_directory

## Demuxalot
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxalot

## Demuxlet
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxlet

## Souporcell
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step souporcell

## Vireo
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step vireo

Demultiplexing pooled cells without prior genotype information

Demultiplex the pooled cells with each of Ensemblex's constituent tools:

## Define the paths to Ensemblex and the working directory 
ensemblex_HOME=/path/to/ensemblex.pip
ensemblex_PWD=/path/to/working_directory

## Freemuxlet
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step freemuxlet

## Souporcell
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step souporcell

## Vireo
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step vireo

## Demuxalot
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step demuxalot

Step 4: Application of Ensemblex

Demultiplexing pooled cells with prior genotype information

## Define the paths to Ensemblex and the working directory 
ensemblex_HOME=/path/to/ensemblex.pip
ensemblex_PWD=/path/to/working_directory

## Compute ensemble classification
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step ensemblexing

Demultiplexing pooled cells without prior genotype information

## Define the paths to Ensemblex and the working directory 
ensemblex_HOME=/path/to/ensemblex.pip
ensemblex_PWD=/path/to/working_directory

## Compute ensemble classification
bash $ensemblex_HOME/launch_ensemblex.sh -d $ensemblex_PWD --step ensemblexing

Contributing

Any contributions or suggestions for improving the Ensemblex pipeline are welcomed and appreciated. If you encounter any issues, please open an issue in the GitHub repository. Alternatively, you are welcomed to email the developers directly; for any questions please contact Michael Fiorini: [email protected]

Changelog

Every release is documented on the GitHub Releases page.

License

This project is licensed under the MIT License.

Acknowledgement

The Ensemblex pipeline was produced for projects funded by the Canadian Institute of Health Research and Michael J. Fox Foundation Parkinson's Progression Markers Initiative (MJFF PPMI) in collaboration with The Neuro's Early Drug Discovery Unit (EDDU), McGill University. It is written by Michael Fiorini and Saeid Amiri with supervision from Rhalena Thomas and Sali Farhan at the Montreal Neurological Institute-Hospital. Copyright belongs MNI BIOINFO CORE.

⬆ back to top