-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exon boundary / intron validation - need for genome build specific validation? #700
Labels
data provider schema change
enhancement
New feature or request
keep alive
exempt issue from staleness checks
Comments
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
github-actions
bot
added
the
stale
Issue is stale and subject to automatic closing
label
Nov 29, 2023
This issue was closed because it has been stalled for 7 days with no activity. |
reece
removed
stale
Issue is stale and subject to automatic closing
closed-by-stale
labels
Dec 8, 2023
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
github-actions
bot
added
the
stale
Issue is stale and subject to automatic closing
label
Mar 11, 2024
This issue was closed because it has been stalled for 7 days with no activity. |
jsstevenson
added
enhancement
New feature or request
keep alive
exempt issue from staleness checks
and removed
stale
Issue is stale and subject to automatic closing
closed-by-stale
labels
Mar 19, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
data provider schema change
enhancement
New feature or request
keep alive
exempt issue from staleness checks
Biocommons HGVS currently performs no validation on whether an intronic coordinate is valid or inside an intron or not.
The trouble is - to perform validation - you need to know information about strandedness, which HGVS does not have access to until it knows the genome build. - this means you can't do the validation in the obvious place
ExtrinsicValidator
probably on the "var_n" (sequence variant of type "n")For instance you could provide a wrong exon boundary. The HGVS Spec on numbering says:
If offset is positive, exon boundary should be in stranded ends
If offset is negative, exon boundary should be in stranded starts
Example 1 (no error w/Biocommons HGVS) - correct boundary is 228, I provide the wrong exon boundary:
ClinGen gives the same error message if the exon boundary is wrong, even if you would be inside the intron (eg
NM_152587.3:c.227+5A>T
)Example 2 (no error w/Biocommons HGVS) - I reverse the offset (from "-" to "+") leaving boundary as is
VariantValidator is looking at the correct exon boundary for the strandedness (ie starts or ends) so even if you use a valid exon boundary (just with signs reversed) it gives the same error
Notes on validation implementation
Key issue: To know the correct exon starts/exon ends - you need to know the transcript's strandedness
It's often easier to work with sequence variants of type "n" as their boundaries correspond to transcript exon start/ends, eg:
But how to map upstream/downstream to exon starts/ends? You need to know strand - and this is NOT provided in any data providers methods that don't take genome build / contig
Valid exon boundaries that map outside the transcript
To work either of these out, you need to know how big the introns are - which you can only get via data provider methods that take a contig/genome build
Offsets of 0 are prohibited
This is a low priority issue, and probably doesn't hurt much to leave it.
But technically,
NM_152587.3:c.228-0=
is invalid.Variant validator doesn't throw an error, but ClinGen allele registry throws "HgvsParsingError - Cannot parse definition of mutation"
The text was updated successfully, but these errors were encountered: