Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlap assembly and coverage trimming to create VariantSequence candidates #57

Merged
merged 11 commits into from
Nov 8, 2016

Conversation

iskandr
Copy link
Contributor

@iskandr iskandr commented Oct 27, 2016

The primary change in this branch is that I added a variant_cdna_sequence_assembly option to the call chain which determines the cDNA sequence from reads (and a corresponding --variant-sequence-assembly option to the CLI). I also added coverage statistics and trimming by coverage to VariantSequence.

Less important changes:

  • Reorganized ReferenceContext by moving the namedtuples from which it inherits fields into separate modules and renaming them to ReferenceSequenceKey and ReferenceCodingSequenceKey.
  • Turned simple functions for constructing objects (e.g. allele read from locus read) into class methods.
  • Caching PyEnsembl on Travis

TODO for a future branch:

This change is Reviewable

@iskandr iskandr changed the title WIP: overlap assembly and coverage trimming to create VariantSequence candidates overlap assembly and coverage trimming to create VariantSequence candidates Nov 4, 2016
@coveralls
Copy link

coveralls commented Nov 4, 2016

Coverage Status

Coverage decreased (-0.5%) to 86.411% when pulling dcfd2ed on use-assembly-and-trimming into f60e654 on master.

Copy link

@ihodes ihodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, would love to go through this with you a bit, and @julia326 is going to look through this as well.

@@ -185,9 +178,9 @@ def iterative_assembly(

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment on the actual line, but curious what the reasoning is behind a default min_overlap_size of 30?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, it should probably be set to MINIMUM_OVERLAP_SIZE?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, I reviewed this entire module, and it looks good to me!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ihodes I'm worried about "false" assemblies that only overlap on the variant nucleotides (but actually come from different splice isoforms). It seems that requiring some amount of overlap beyond the variant nucleotides will decrease the chances of false assembly but the exact amount (30) is pretty arbitrarily chosen: shorter than a read length but long enough to be a kmer unlikely to happen by chance.

@@ -72,3 +72,8 @@

# number of protein sequences we want to return per variant
MAX_PROTEIN_SEQUENCES_PER_VARIANT = 1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should MIN_OVERLAPPING_READS go here as well? What makes it into here?

@iskandr iskandr merged commit 1dff463 into master Nov 8, 2016
@ihodes ihodes deleted the use-assembly-and-trimming branch January 25, 2017 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants