Skip to content

How to run doCNA.

Karol Szlachta edited this page Aug 25, 2023 · 9 revisions

doCNA - mathematical model based decomposition of Whole Genome Sequencing data into regions of different Copy Number status

Brief description

After initial filtration to include only SNP from SuperGood list into analysis, the script extracts parameters of diploid reference. Eachof the chromosomes, is next divided into segments of uniform CNV status based on baf and coverage. In the last step, karyotypes are assigned to each of the segments in regard to the reference. Results of automatic analysis should be reviewed. Majority of errors can be corrected by adjusting reference coverage (-m0 option).

For a detail description of the method, please refer to the [wiki]. It is strongly encouraged to learn about details of the method before using it.

Running the analysis

usage: docna analyze [-h] [-s SAMPLE_NAME] [-n NO_PROCESSES] -i INPUT_FILE [-c CONFIG] 
                     [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL,NOTSET}] [-r] [-m0 COVERAGE_DIPLOID] [-v] 
                     [-m {AB)n,A,AA,AAB,AAAB,AAA,AAAA,AAB+AAAB,AA+AAB,AAB+AABB} [{(ABn,A,AA,AAB,AAAB,AAA,AAAA,AAB+AAAB,AA+AAB,AAB+AABB} ...]]
runs the analysis

optional arguments:
  -h, --help            show this help message and exit
  -s SAMPLE_NAME, --sample_name SAMPLE_NAME
                        Input sample name. Default: from file name.
  -n NO_PROCESSES, --no_processes NO_PROCESSES
                        Number of processes. Default: 1
  -i INPUT_FILE, --input_file INPUT_FILE
                        Name of the input file with alleles' counts
  -c CONFIG, --config CONFIG
                        INI file with parameters
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL,NOTSET}, --level {DEBUG,INFO,WARNING,ERROR,CRITICAL,NOTSET}
                        Level of verbosity for std.err logger.
  -r, --report_solutions
                        Generate report with all solutions.
  -m0 COVERAGE_DIPLOID, --coverage_diploid COVERAGE_DIPLOID
                        Coverage of diploid.
  -m {(AB)n,A,AA,AAB,AAAB,AAA,AAAA,AAB+AAAB,AA+AAB,AAB+AABB} [{(AB)n,A,AA,AAB,AAAB,AAA,AAAA,AAB+AAAB,AA+AAB,AAB+AABB} ...], --models {(AB)n,A,AA,AAB,AAAB,AAA,AAAA,AAB+AAAB,AA+AAB,AAB+AABB} [{(AB)n,A,AA,AAB,AAAB,AAA,AAAA,AAB+AAAB,AA+AAB,AAB+AABB} ...]
                        Specify which of models should be included.
  -v, --version         Print version

Note that -m option depends on the contents of Models.py *Note that script is configured by default to reference chromosome names, with the exception of missing "chr" that will be automatically added. However, it can be easily adjusted to handle virtually any names of chromosomes, by simply editing CHROMS_ORDER list in Consts.py. "

Input files format

The input file

The input file, -i, should contain tab separated allele counts. It is expected to contain at least five columns and a header:

  • chromosome name
  • genomic position
  • count of reads with reference
  • count of reads with non-reference
  • Type of non-referencness. Only SNP are used.

The exact names of the columns can be specified in the config file, in the [InputColumns] section.

Allele counts can be relatively easily obtained by other software, created for that purpose.

Besides the above, script expects two more files with genomic data: cyto-bands and list of SuperGood markers. If no filtering (-f False) of input file if required, SuperGood markers are not used and can be omitted. Files names must be specified in [Input] section of the config file.

Cyto-band file

File is expected to contain five, tab delimited, columns:

  • '#chrom': name of the chromosome
  • chromStart: start position
  • chromEnd: end position
  • name: name of the cyto band; for example p36.33
  • gieStain: intensity of the band, 'acen' is for centromere.

SuperGood markers file

File is expected to contain two, tab delimited, columns:

  • chromosome name
  • genomic position

The names of those files can be altered by editing configuration file, the [Input] section.

Output files

Script can output 5 files: found CNA segments, processed data, detailed log, runs and all considered solutions (refer to wiki for detailed explanation).

CNA segments {sample_name}.bed

Found CNA segments are reported in a format of BED-like file. First three describe genomic range (chromosome, start, end) followed by estimated parameters: allelic imbalance, number of windows within the segments, coverage, copy number, distance to diploid reference, score of diploidy, model, distance to model, model score, clonality, segment symbol, cytobands, centromere fraction.

Processed data {sample_name}.dat

File contains input count data, filtered through SG markers list, unless specified otherwise, with 3 additional columns: coverage, vaf and assigned symbol E, N, or U. Useful for papardelle plot.

Detailed log {sample_name}.log

Contents of this file is self-explanatory.

Solutions {sample_name}.solutions

File reports all found solutions for each run. Intended as an aid in more in-depth analysis of complex cases and debugging.

Parameters {sample_name}.par

Genome-wide parameters are dumped into this file. The meaning of the parameters is as follows:

$m_0$, l - diploid reference parameters as estimated by lambda distribution for E part of the chromosomes, used to calculate copy number. The l stands for $\lambda$

$v_0$ - vaf/baf of diploid reference, based on BAF test

fb - widening parameter

$m$, $s$ - parameters of the gaussian, index (marked with underscore) indicates the parameter modeled

indexes $ai$ and $cn$ indicates allelic imbalance and copy number, respectively. Those are used in diploidy distance test to z score the distance 

index $d$ indicates the distance to diploid reference distribution

$a_d$ is a parameter in exponential distribution describirng distance to the model

models - list of models used to kariotype the sample

thr_model - estimated threshold for model score

thr_HE - estimated threshold for score of diploidity /score_HE/