Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensembl transcript ENST00000617537.5 sequence is genomic not cdna #75

Open
davmlaw opened this issue Oct 1, 2024 · 9 comments
Open
Labels
bug Something isn't working

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Oct 1, 2024

https://asia.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000136250;r=7:36512941-36724494;t=ENST00000617537

Web page Reports that the sequence is 2385 bases long

Ensembl API is in agreement:

In [25]: import requests
In [26]: url = "https://rest.ensembl.org/sequence/id/ENST00000617537?type=cdna"
In [27]: r = requests.get(url, headers={"Content-Type": "application/json"}, timeout=60)
In [28]: len(r.json()["seq"])
Out[28]: 2385

SeqRepo returns much longer sequence:

In [20]: from hgvs.dataproviders.seqfetcher import SeqFetcher

In [21]: sf = SeqFetcher()

In [22]: seq = sf.fetch_seq("ENST00000617537.5")

In [23]: len(seq)
Out[23]: 211554
@davmlaw davmlaw added the bug Something isn't working label Oct 1, 2024
@davmlaw davmlaw changed the title Ensembl transcript sequence is wrong Ensembl transcript ENST00000617537.5 sequence is wrong Oct 1, 2024
@davmlaw davmlaw changed the title Ensembl transcript ENST00000617537.5 sequence is wrong Ensembl transcript ENST00000617537.5 sequence is wrong (88x larger than it should be) Oct 1, 2024
@reece
Copy link
Member

reece commented Oct 2, 2024

Well, that's very surprising. I'll need to investigate.

FWIW, UCSC appears to have fallen into the same issue:
image
(from https://genome.cse.ucsc.edu/cgi-bin/hgGene?hgg_gene=UC032ZJW.2, taken today)

@reece
Copy link
Member

reece commented Oct 2, 2024

Okay, found it. Not a bug. Ensembl itself returns this sequence:

❯ curl -s -H "accept: application/json" 'https://rest.ensembl.org/sequence/id/ENST00000617537' | jq -r .seq | wc  -c
211555

(+1 char for the newline)

Also, the code path that you're using is bioutils.seqfetcher, via hgvs dataproviders.

I double checked seqrepo and it's fine:

❯ seqrepo  -r /usr/local/share/seqrepo  start-shell -i latest 
Python 3.12.3 (main, Sep 11 2024, 14:17:37) [GCC 13.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.27.0 -- An enhanced Interactive Python. Type '?' for help.


seqrepo (https://github.com/biocommons/biocommons.seqrepo/)
version: 0.6.9
instance path: /usr/local/share/seqrepo/latest

>>> q=sr["ENST00000617537.5"]
>>> len(q)
2385

>>> q = bioutils.seqfetcher.fetch_seq("ENST00000617537.5")
>>> len(q)
211554

@reece reece closed this as completed Oct 2, 2024
@davmlaw
Copy link
Contributor Author

davmlaw commented Oct 3, 2024

@reece the Ensembl REST API endpoint https://rest.ensembl.org/documentation/info/sequence_id

has an argument type which:

Defaults to genomic where applicable, i.e. not translations. cdna refers to the spliced transcript sequence with UTR; cds refers to the spliced transcript sequence without UTR.

If you look, I'm using type=cdna - while when you leave it off, you're using genomic - which explains the difference.

So should SeqRepo should use the cdna rather than genomic? That looks like how it's done with RefSeq?

@reece reece transferred this issue from biocommons/biocommons.seqrepo Oct 3, 2024
@reece
Copy link
Member

reece commented Oct 3, 2024

This issue is totally unrelated to SeqRepo. SeqFetcher is part of bioutils. I've just moved this issue from biocommons.seqrepo to bioutils.

The underlying cause is that an ENST id refers to a family of sequences, not a single sequence. The cDNA sequence is what I consider to be the main sequence, and that's what's in SeqRepo. It's also what's shown on Ensembl web pages for a given ENST. And it's what corresponds to RefSeq transcripts. It is very unfortunate that a single id refers to three sequences.

Honestly, this all makes me think that we should remove support for fetching ENSTs with SeqFetcher (here) since it can't be done unambiguously. It's better to not have this support than to try to chase Ensembl's intentions with identifiers. Alternatively, we could assume cdna and add type=cdna to the http get.

❯ curl -s -H "accept: application/json" 'https://rest.ensembl.org/sequence/id/ENST00000617537?type=cds' | jq '{id: .id, version: .version, seq_length: (.seq | length)}'
{
  "id": "ENST00000617537",
  "version": 5,
  "seq_length": 1728
}

❯ curl -s -H "accept: application/json" 'https://rest.ensembl.org/sequence/id/ENST00000617537?type=cdna' | jq '{id: .id, version: .version, seq_length: (.seq | length)}'
{
  "id": "ENST00000617537",
  "version": 5,
  "seq_length": 2385
}

❯ curl -s -H "accept: application/json" 'https://rest.ensembl.org/sequence/id/ENST00000617537' | jq '{id: .id, version: .version, seq_length: (.seq | length)}'     
{
  "id": "ENST00000617537",
  "version": 5,
  "seq_length": 211554
}

@davmlaw
Copy link
Contributor Author

davmlaw commented Oct 3, 2024

OK with you moving issue to wherever you think is best.

I've re-opened it as I think it's an open question on what to do regarding ENST's - hope that's ok

@davmlaw davmlaw reopened this Oct 3, 2024
@davmlaw davmlaw changed the title Ensembl transcript ENST00000617537.5 sequence is wrong (88x larger than it should be) Ensembl transcript ENST00000617537.5 sequence is genomic not cdna Oct 3, 2024
@reece
Copy link
Member

reece commented Oct 4, 2024

I'm fine with you reopening it if you think it needs more discussion.

The whole point of seqfetcher is to create a simple interface for fetching a sequence for a given identifier (and just the identifier). Identifiers by definition should map to only one value. ENSTs are not actually identifiers at all in this sense.

So, my hot take is that we should just remove support for ENSTs from seqfetcher.

If we're going to leave this open, would you please suggest what outcome you would like to see from this issue?

@jsstevenson
Copy link
Contributor

Just pitching in: I'd be a little bummed if we lost support for Ensembl transcripts. Could we just assume cDNA and log a warning indicating as much every time an ENST is requested? That way, behavior is at least consistent and somewhat transparent.

@reece
Copy link
Member

reece commented Oct 4, 2024

Yep we could assume cDNA. That's definitely the best solution.

@davmlaw
Copy link
Contributor Author

davmlaw commented Oct 5, 2024

I also vote for cdna by default.

My main goal is I want Ensembl HGVS to resolve correctly, or failing that, not at all (better to not support)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants