Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add region input queries #97

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Add region input queries #97

wants to merge 1 commit into from

Conversation

ahc87
Copy link

@ahc87 ahc87 commented Oct 20, 2022

https://www.ebi.ac.uk/panda/jira/browse/EA-1016
https://www.ebi.ac.uk/panda/jira/browse/EA-1018
https://www.ebi.ac.uk/panda/jira/browse/EA-1017

This PR adds region input queries, associated exceptions, and tests.

Mypy, Pylint, and Pytest all pass.

This is a draft because we cannot push it until we have updated the loading scripts populate regions with genome_ids.

@ahc87 ahc87 marked this pull request as draft October 20, 2022 15:40
@@ -16,6 +16,8 @@ type Query {
start: Int @deprecated(reason: "Use `by_slice`"),
end: Int @deprecated(reason: "Use `by_slice`")
by_slice: SliceInput): Locus
region(by_name: RegionNameInput!): Region
regions(by_genome_id: GenomeIdInput!): [Region]
Copy link
Collaborator

@azangru azangru Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I realised we are missing from the CDM — if we allow queries like this — is a field with some kind of ranking information in Region.

Consider an example in which you are requesting all human chromosomes. You will likely want to display them in a specific order. There is no way for the client to know, for example, that human chromosome X goes after human chromosome 22, unless someone has manually curated the list of top-level regions, and the backend tells the client this order.

cc: @bethflint

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a CDM thing or a thing we want to support in our GraphQL service which is not part of CDM? And where does this data currently reside?

Copy link
Collaborator

@azangru azangru Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a CDM thing to me. One needs to somehow represent the information that the first chromosome of Saccharomyces cerevisiae is chromosome I, or that human karyotype ends in chromosomes 22, X, Y, MT, in that order.

And where does this data currently reside?

🤷‍♂️

do you know @ens-st3 ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MySQL [homo_sapiens_core_108_38]> select sr.name, sra.value from attrib_type at, seq_region_attrib sra, seq_region sr where at.attrib_type_id = sra.attrib_type_id and sra.seq_region_id = sr.seq_region_id and at.code = 'karyotype_rank' order by sra.value;
+------+-------+
| name | value |
+------+-------+
| 1 | 1 |
| 10 | 10 |
| 11 | 11 |
| 12 | 12 |
| 13 | 13 |
| 14 | 14 |
| 15 | 15 |
| 16 | 16 |
| 17 | 17 |
| 18 | 18 |
| 19 | 19 |
| 2 | 2 |
| 20 | 20 |
| 21 | 21 |
| 22 | 22 |
| X | 23 |
| Y | 24 |
| MT | 25 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
| 9 | 9 |
+------+-------+
25 rows in set (0.003 sec)

query = {
"type": "Region",
"genome_id": by_genome_id["genome_id"],
"code": "chromosome", # We filter on chromosomes to avoid returning an overwhelming number of regions
Copy link
Collaborator

@azangru azangru Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, do you have an overwhelming number of regions? Then this is likely another of our "long lists of things" cases, and we should give the client an option to request different kinds of regions. Possibly making it an optional query parameter, in the absence of which Thoas will respond with top-level regions.

By the way, not all organisms in our databases will have top-level regions that are chromosomes. @ens-st3 has chosen an organism (Chacoan peccary, I believe) whose genome has only been sequenced to a very provisional level (primary assemblies, is it?) for inclusion in Thoas as part of the next batch of species, to confirm that we are able to handle organisms like that. The code above won't be able to find top-level regions for such organisms.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was because it was mentioned that there would potentially be millions of regions for some assemblies. I think a decision needs to be made about how we handle pagination before we can handle those very large numbers.

Do we know how many non-chromosomal regions the assemblies we know we're going to be handling in the short and mid term are going to have?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an attribute in the "attrib_type" table called "Top level" with description "Top Level Non-Redundant Sequence Region" which looks like it could be a better candidate to filter on than "chromosome"

Copy link

@ens-st3 ens-st3 Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MySQL [catagonus_wagneri_core_108_2]> select count() from seq_region sr, seq_region_attrib sra, attrib_type at where sr.seq_region_id = sra.seq_region_id and sra.attrib_type_id = at.attrib_type_id and at.code = 'toplevel';
+----------+
| count(
) |
+----------+
| 1183730 |
+----------+

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I've added it to sta-6 just in case we choose to support these now. If there are similar genomes in rapid, which is prob unlikely then we will need to, but if not it's our choice.
Another 'interesting' example is very long top level sequences, eg
MySQL [monodelphis_domestica_core_108_1]> select max(sr.length) from seq_region sr, seq_region_attrib sra, attrib_type at where sr.seq_region_id = sra.seq_region_id and sra.attrib_type_id = at.attrib_type_id and at.code = 'toplevel';
+----------------+
| max(sr.length) |
+----------------+
| 748055161 |
+----------------+

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steve's first query shows that filtering on "toplevel" wouldn't product us from getting back an overwhelming number of regions, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll exclude the constituent parts of assemblies (contigs, clones etc) which will help massively. There should never be any annotation on these - they should only ever be on toplevel regions - and I suggest there's no reason for them to be in Thoas.
Is 1.2M toplevel sequences overwhelming ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if 1.2M would be overwhelming. Users can already overwhelm the service if they want to by sending very deep queries. I agree with @bethflint , I think we should make a decision on pagination. That would give us a way to limit queries, and then we won't have to worry about them being too heavy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants