overlap assembly and coverage trimming to create VariantSequence candidates #57

iskandr · 2016-10-27T23:43:27Z

The primary change in this branch is that I added a variant_cdna_sequence_assembly option to the call chain which determines the cDNA sequence from reads (and a corresponding --variant-sequence-assembly option to the CLI). I also added coverage statistics and trimming by coverage to VariantSequence.

Less important changes:

Reorganized ReferenceContext by moving the namedtuples from which it inherits fields into separate modules and renaming them to ReferenceSequenceKey and ReferenceCodingSequenceKey.
Turned simple functions for constructing objects (e.g. allele read from locus read) into class methods.
Caching PyEnsembl on Travis

TODO for a future branch:

If we can't establish a ReferenceContext for a variant sequence with only 1 reading supporting the very beginning of its sequence, try trimming to the next highest coverage level(s) until a reference context can be established. (Trim VariantSequences by increasing coverage while trying to figure out reading frame #58)
merge overlapping paired reads into single LocusRead objects (extends the length of sequences we can construct). (Merge overlapping read pairs into single LocusRead #59)

This change is

…evels of coverage

…eKey

coveralls · 2016-11-04T22:37:53Z

Coverage decreased (-0.5%) to 86.411% when pulling dcfd2ed on use-assembly-and-trimming into f60e654 on master.

ihodes

LGTM, would love to go through this with you a bit, and @julia326 is going to look through this as well.

ihodes · 2016-11-07T18:01:52Z

isovar/assembly.py

@@ -185,9 +178,9 @@ def iterative_assembly(



Can't comment on the actual line, but curious what the reasoning is behind a default min_overlap_size of 30?

Additionally, it should probably be set to MINIMUM_OVERLAP_SIZE?

Otherwise, I reviewed this entire module, and it looks good to me!

@ihodes I'm worried about "false" assemblies that only overlap on the variant nucleotides (but actually come from different splice isoforms). It seems that requiring some amount of overlap beyond the variant nucleotides will decrease the chances of false assembly but the exact amount (30) is pretty arbitrarily chosen: shorter than a read length but long enough to be a kmer unlikely to happen by chance.

ihodes · 2016-11-07T18:30:58Z

isovar/default_parameters.py

@@ -72,3 +72,8 @@

 # number of protein sequences we want to return per variant
 MAX_PROTEIN_SEQUENCES_PER_VARIANT = 1
+


Should MIN_OVERLAPPING_READS go here as well? What makes it into here?

assemble reads and then trimmed assembled sequences down to desired l…

d13e68d

…evels of coverage

iskandr force-pushed the use-assembly-and-trimming branch from 7196c6d to d13e68d Compare October 27, 2016 23:45

iskandr added 9 commits October 27, 2016 21:18

testing with and without read overlap assembly

0e653dd

got unit tests with assmembly passing, working on adding more

b80d3d4

splitting reference_context into three modules

a8ae160

fixed unit tests with ReferenceSequenceKey and ReferenceCodingSequenc…

d48b88e

…eKey

added reference coding sequence module and tests

1c88072

renamed CLI arg for assembly

78b4180

added caching of pyensembl

e14e019

version bump due to large change in logic

881aa04

assembly unit test with mock reads

bb757cb

iskandr changed the title ~~WIP: overlap assembly and coverage trimming to create VariantSequence candidates~~ overlap assembly and coverage trimming to create VariantSequence candidates Nov 4, 2016

fixed name of reads arg

dcfd2ed

ihodes reviewed Nov 7, 2016

View reviewed changes

iskandr merged commit 1dff463 into master Nov 8, 2016

ihodes deleted the use-assembly-and-trimming branch January 25, 2017 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overlap assembly and coverage trimming to create VariantSequence candidates #57

overlap assembly and coverage trimming to create VariantSequence candidates #57

iskandr commented Oct 27, 2016 •

edited

Loading

coveralls commented Nov 4, 2016 •

edited

Loading

ihodes left a comment

ihodes Nov 7, 2016

ihodes Nov 7, 2016

ihodes Nov 7, 2016

iskandr Nov 8, 2016

ihodes Nov 7, 2016

		@@ -72,3 +72,8 @@

		# number of protein sequences we want to return per variant
		MAX_PROTEIN_SEQUENCES_PER_VARIANT = 1

overlap assembly and coverage trimming to create VariantSequence candidates #57

overlap assembly and coverage trimming to create VariantSequence candidates #57

Conversation

iskandr commented Oct 27, 2016 • edited Loading

coveralls commented Nov 4, 2016 • edited Loading

ihodes left a comment

Choose a reason for hiding this comment

ihodes Nov 7, 2016

Choose a reason for hiding this comment

ihodes Nov 7, 2016

Choose a reason for hiding this comment

ihodes Nov 7, 2016

Choose a reason for hiding this comment

iskandr Nov 8, 2016

Choose a reason for hiding this comment

ihodes Nov 7, 2016

Choose a reason for hiding this comment

iskandr commented Oct 27, 2016 •

edited

Loading

coveralls commented Nov 4, 2016 •

edited

Loading