-
Notifications
You must be signed in to change notification settings - Fork 1
How to run doCNA.
doCNA - mathematical model based decomposition of Whole Genome Sequencing data into regions of different Copy Number status
After initial filtration to include only SNP from SuperGood list into analysis, the script extracts parameters of diploid reference. Eachof the chromosomes, is next divided into segments of uniform CNV status based on baf and coverage. In the last step, karyotypes are assigned to each of the segments in regard to the reference. Results of automatic analysis should be reviewed. Majority of errors can be corrected by adjusting reference coverage (-m0 option).
For a detail description of the method, please refer to the [wiki]. It is strongly encouraged to learn about details of the method before using it.
usage: docna analyze [-h] [-s SAMPLE_NAME] [-n NO_PROCESSES] -i INPUT_FILE [-c CONFIG]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL,NOTSET}] [-r] [-m0 COVERAGE_DIPLOID] [-v]
[-m {AB)n,A,AA,AAB,AAAB,AAA,AAAA,AAB+AAAB,AA+AAB,AAB+AABB} [{(ABn,A,AA,AAB,AAAB,AAA,AAAA,AAB+AAAB,AA+AAB,AAB+AABB} ...]]
runs the analysis
optional arguments:
-h, --help show this help message and exit
-s SAMPLE_NAME, --sample_name SAMPLE_NAME
Input sample name. Default: from file name.
-n NO_PROCESSES, --no_processes NO_PROCESSES
Number of processes. Default: 1
-i INPUT_FILE, --input_file INPUT_FILE
Name of the input file with alleles' counts
-c CONFIG, --config CONFIG
INI file with parameters
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL,NOTSET}, --level {DEBUG,INFO,WARNING,ERROR,CRITICAL,NOTSET}
Level of verbosity for std.err logger.
-r, --report_solutions
Generate report with all solutions.
-m0 COVERAGE_DIPLOID, --coverage_diploid COVERAGE_DIPLOID
Coverage of diploid.
-m {(AB)n,A,AA,AAB,AAAB,AAA,AAAA,AAB+AAAB,AA+AAB,AAB+AABB} [{(AB)n,A,AA,AAB,AAAB,AAA,AAAA,AAB+AAAB,AA+AAB,AAB+AABB} ...], --models {(AB)n,A,AA,AAB,AAAB,AAA,AAAA,AAB+AAAB,AA+AAB,AAB+AABB} [{(AB)n,A,AA,AAB,AAAB,AAA,AAAA,AAB+AAAB,AA+AAB,AAB+AABB} ...]
Specify which of models should be included.
-v, --version Print version
Note that -m option depends on the contents of Models.py *Note that script is configured by default to reference chromosome names, with the exception of missing "chr" that will be automatically added. However, it can be easily adjusted to handle virtually any names of chromosomes, by simply editing CHROMS_ORDER list in Consts.py. "
The input file, -i, should contain tab separated allele counts. It is expected to contain at least five columns and a header:
- chromosome name
- genomic position
- count of reads with reference
- count of reads with non-reference
- Type of non-referencness. Only SNP are used.
The exact names of the columns can be specified in the config file, in the [InputColumns] section.
Allele counts can be relatively easily obtained by other software, created for that purpose.
Besides the above, script expects two more files with genomic data: cyto-bands and list of SuperGood markers. If no filtering (-f False) of input file if required, SuperGood markers are not used and can be omitted. Files names must be specified in [Input] section of the config file.
File is expected to contain five, tab delimited, columns:
- '#chrom': name of the chromosome
- chromStart: start position
- chromEnd: end position
- name: name of the cyto band; for example p36.33
- gieStain: intensity of the band, 'acen' is for centromere.
File is expected to contain two, tab delimited, columns:
- chromosome name
- genomic position
The names of those files can be altered by editing configuration file, the [Input] section.
Script can output 5 files: found CNA segments, processed data, detailed log, runs and all considered solutions (refer to wiki for detailed explanation).
Found CNA segments are reported in a format of BED-like file. First three describe genomic range (chromosome, start, end) followed by estimated parameters: allelic imbalance, number of windows within the segments, coverage, copy number, distance to diploid reference, score of diploidy, model, distance to model, model score, clonality, segment symbol, cytobands, centromere fraction.
File contains input count data, filtered through SG markers list, unless specified otherwise, with 3 additional columns: coverage, vaf and assigned symbol E, N, or U. Useful for papardelle plot.
Contents of this file is self-explanatory.
File reports all found solutions for each run. Intended as an aid in more in-depth analysis of complex cases and debugging.
Genome-wide parameters are dumped into this file. The meaning of the parameters is as follows:
fb - widening parameter
indexes
index
models - list of models used to kariotype the sample
thr_model - estimated threshold for model score
thr_HE - estimated threshold for score of diploidity /score_HE/