####################################################################
Source code and datasets of Para-DPMM model for single cell transcriptomic clustering to reproduce results in paper "Parallel Clustering of Single Cell Transcriptomic Data with Split-Merge Sampling on Dirichlet Process Mixtures", Author: Tiehang Duan; José P. Pinto; Xiaohui Xie;
In the datasets folder, we included the mat files that are used in the paper's experiment part. You can also prepare your own data following the procedures below.
Download raw datasets from 10X genomics website (https://support.10xgenomics.com/single-cell-gene-expression/datasets) and store the files in the datasets folder;
Follow the comments in "data_preparation.m" to modify the file names based on the downloaded data files;
Follow the comments in "data_preparation.m" to set the number of cells (randomly selected) and top variable genes;
Run "data_preparation.m";
Enter the "Para_DPMM_Source_Code" Directory
We tested the code with Matlab2015a and gcc/4.8.4 (gcc/4.6.X and gcc/4.7.X should also work), the installation of following two libraries is needed:
- GNU Scientific Library: http://www.gnu.org/software/gsl/
- Eigen library: http://eigen.tuxfamily.org/
Steps:
(1) Install the packages mentioned above (for the Eigen library, you only need to place the unzipped files inside the "eigen" directory);
(2) Start Matlab, enter the "main" sub directory in Matlab;
(3) In Matlab, run compile_MEX.m;
(4) Run Para_DPMM.m, follow the guidelines given in the program (an example):
1> Please enter dataset path: data_matrix_1_S_Set.mat
2> Please enter number of processors: 16
3> Please set the value of alpha: 1
4> Please enter computing time limit (seconds): 20
(5) Result: The training iteration and computation time log is saved in Para_DPMM_output.txt, the clustering result is saved in Para_DPMM_result.mat, where z is the clustering result of Para_DPMM model, label is the ground truth cluster label, AR is Adjusted Random Index, RI is Random Index benchmark, MI is "Mirkin's" index and HI is "Hubert's" index.
The execution of compiled package is tested on HPC clusters with module gcc/6.1.0, gsl/2.3 and MATLAB/r2017b loaded;
Steps:
1> module load gcc/6.1.0
2> module load gsl/2.3
3> module load MATLAB/r2017b
4> ./Para_DPMM
1> Please enter dataset path: ./datasets/data_matrix_1_S_Set.mat
2> Please enter number of processors: 16
3> Please set the value of alpha: 1
4> Please enter computing time limit (seconds): 40
The training iteration and computation time log is saved in Para_DPMM_output.txt, the clustering result is saved in Para_DPMM_result.mat, where z is the clustering result of Para_DPMM model, label is the ground truth cluster label, AR is Adjusted Random Index, RI is Random Index benchmark, MI is "Mirkin's" index and HI is "Hubert's" index.
In the paper, we performed comparison with several current widely used single cell clustering methods. Most of the methods are available in the form of R package, and in the "interface to comparison methods" folders, we provide interface programs (written in R) to use these datasets with the available R packages for comparison. Please install the related R pacakges before using the interface programs.
Please feel free to use it for academic purposes.
Note: The Para-DPMM project depend heavily on the open source Dirichlet Process Mixtures package(http://people.csail.mit.edu/jchang7/code.php) written by Jason Chang.
###############################################################################