To setup BLR for development it is benefitial to install it in editable mode. Use the command below:
pip install -e .[dev]
This also install development tools such as pytest
and flake8
.
Testdata needs to be downloaded and setup for running the tests. Copy and run the block the block below.
testdata_version=0.6 wget -nv https://export.uppmax.uu.se/uppstore2018173/blr-testdata-${testdata_version}.tar.gz tar xf blr-testdata-${testdata_version}.tar.gz ln -s blr-testdata-${testdata_version} blr-testdata
Now that everything is setup tests can be run using the tests/run.sh script.
bash tests/run.sh
flake8
is used for linting. Either run the command below before commiting or setup pre-commit to run this automatically.
flake8 src tests
There are two types of files in this repository that describe conda environments.
- The file
environment.yml
contains abstract dependencies such aspysam
orbowtie2
. This file is managed manually and needs to be updated whenever there are new dependencies or when the required version for a dependency changes. - The
environment.linux-64.lock
andenvironment.osx-64.lock
files (lock files) contain explicitly-defined environments which are reproducible and platform dependant. These files are used to create the test environments.
Whenever the environment.yml
file is updated, you need to run:
conda-lock -f environment.yml -p linux-64 -p osx-64 --filename-template "environment.{platform}.lock"`
to generate the environment.{platform}.lock
files.
Install conda-lock using pip
pip install conda-lock
or conda
conda install conda-lock -c conda-forge
Specifications on SAM-tags used for holding information during data processing and which argparse option flags to use when specifying them in python scripts. The 10x Genomics barcoded BAM format is followed where that information is defined.
SAM-tag | Option flag | Description |
---|---|---|
BX | -b, --barcode-tag | String for the error-corrected barcode |
MI | -m, --molecule-tag | Integer index for an identified molecule for each barcode [*] |
RX | -s, --sequence-tag | String for the uncorrected barcode sequence |
HP | Integer (1 or 2) for the read haplotype assigned | |
PS | Integer for the phase set (phaseblock) that the read is part of | |
PC | Integer for the quality of the phase set (phaseblock) |
[*] | Note that the index is only unique within the particular chunk. |
To run profiling on a particular subcommand you can use the --profile
argument. For example with
the subcommand tagbam
the command is:
blr --profile tagbam input.bam -o output.bam
This command will generate a file called blr_tagbam.prof
with all the profiling information. This
can then be used with Python's standard library module
pstat
or for example Snakeviz which allows interaction through the browser.
Chunks are the separate portions of the mapped bam that go through postprocessing. Each chunk might contain one or
more contigs. Chunks are handled in the Snakemake workflow through a dictionary called chunks
which define contigs
as different sets. Each set contains a list of chunks which inturn contain lists of contigs composing each chunk. Chunks are referred to by
the first contig name. The three primary sets are accessed by the following keys:
'all'
= handles every contig in reference
'primary'
= handle every contig in reference that should go through certain post-processing steps (see below). Is a subset of 'all'.
'phased'
= handles every contig in reference that is diploid i.e. can be phased. Is a subset of 'primary'.
Several subsets of these are also defined for convinence.
'not_phased' = 'all' - 'phased'
'not_primary' = 'all' - 'primary'
'primary_not_phased' = 'primary' - 'phased'
These sets are used to control which contigs go through which processing steps. Which contigs are included are defined
through the phasing_contigs
(for 'phased'
) and contigs_skipped
(for 'not_primary'
) parameters in the
config file blr.yml
.
Processing steps run by 'primary'
contigs but not 'all'
:
- find_clusterdups
- get_barcode_merges
- concat_molecule_stats
- get_barcodes_to_filter
- call_variants
- lsv_calling
Processing steps run by 'phased'
contigs but not 'primary'
:
- hapcut2_extracthairs
- hapcut2_linkfragments
- hapcut2_phasing
- build_config