-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overlap assembly and coverage trimming to create VariantSequence candidates #57
Conversation
…evels of coverage
7196c6d
to
d13e68d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, would love to go through this with you a bit, and @julia326 is going to look through this as well.
@@ -185,9 +178,9 @@ def iterative_assembly( | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't comment on the actual line, but curious what the reasoning is behind a default min_overlap_size
of 30?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, it should probably be set to MINIMUM_OVERLAP_SIZE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, I reviewed this entire module, and it looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ihodes I'm worried about "false" assemblies that only overlap on the variant nucleotides (but actually come from different splice isoforms). It seems that requiring some amount of overlap beyond the variant nucleotides will decrease the chances of false assembly but the exact amount (30) is pretty arbitrarily chosen: shorter than a read length but long enough to be a kmer unlikely to happen by chance.
@@ -72,3 +72,8 @@ | |||
|
|||
# number of protein sequences we want to return per variant | |||
MAX_PROTEIN_SEQUENCES_PER_VARIANT = 1 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should MIN_OVERLAPPING_READS
go here as well? What makes it into here?
The primary change in this branch is that I added a
variant_cdna_sequence_assembly
option to the call chain which determines the cDNA sequence from reads (and a corresponding--variant-sequence-assembly
option to the CLI). I also added coverage statistics and trimming by coverage toVariantSequence
.Less important changes:
ReferenceContext
by moving the namedtuples from which it inherits fields into separate modules and renaming them toReferenceSequenceKey
andReferenceCodingSequenceKey
.TODO for a future branch:
If we can't establish a
ReferenceContext
for a variant sequence with only 1 reading supporting the very beginning of its sequence, try trimming to the next highest coverage level(s) until a reference context can be established. (Trim VariantSequences by increasing coverage while trying to figure out reading frame #58)merge overlapping paired reads into single
LocusRead
objects (extends the length of sequences we can construct). (Merge overlapping read pairs into single LocusRead #59)This change is