Add region input queries #97

ahc87 · 2022-10-20T14:06:49Z

https://www.ebi.ac.uk/panda/jira/browse/EA-1016
https://www.ebi.ac.uk/panda/jira/browse/EA-1018
https://www.ebi.ac.uk/panda/jira/browse/EA-1017

This PR adds region input queries, associated exceptions, and tests.

Mypy, Pylint, and Pytest all pass.

This is a draft because we cannot push it until we have updated the loading scripts populate regions with genome_ids.

azangru · 2022-10-21T08:31:12Z

common/schemas/query.graphql

@@ -16,6 +16,8 @@ type Query {
                 start: Int @deprecated(reason: "Use `by_slice`"),
                 end: Int @deprecated(reason: "Use `by_slice`")
                 by_slice: SliceInput): Locus
+  region(by_name: RegionNameInput!): Region
+  regions(by_genome_id: GenomeIdInput!): [Region]


What I realised we are missing from the CDM — if we allow queries like this — is a field with some kind of ranking information in Region.

Consider an example in which you are requesting all human chromosomes. You will likely want to display them in a specific order. There is no way for the client to know, for example, that human chromosome X goes after human chromosome 22, unless someone has manually curated the list of top-level regions, and the backend tells the client this order.

cc: @bethflint

Is that a CDM thing or a thing we want to support in our GraphQL service which is not part of CDM? And where does this data currently reside?

It looks like a CDM thing to me. One needs to somehow represent the information that the first chromosome of Saccharomyces cerevisiae is chromosome I, or that human karyotype ends in chromosomes 22, X, Y, MT, in that order.

And where does this data currently reside?

🤷‍♂️

do you know @ens-st3 ?

MySQL [homo_sapiens_core_108_38]> select sr.name, sra.value from attrib_type at, seq_region_attrib sra, seq_region sr where at.attrib_type_id = sra.attrib_type_id and sra.seq_region_id = sr.seq_region_id and at.code = 'karyotype_rank' order by sra.value;
+------+-------+
| name | value |
+------+-------+
| 1 | 1 |
| 10 | 10 |
| 11 | 11 |
| 12 | 12 |
| 13 | 13 |
| 14 | 14 |
| 15 | 15 |
| 16 | 16 |
| 17 | 17 |
| 18 | 18 |
| 19 | 19 |
| 2 | 2 |
| 20 | 20 |
| 21 | 21 |
| 22 | 22 |
| X | 23 |
| Y | 24 |
| MT | 25 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
| 9 | 9 |
+------+-------+
25 rows in set (0.003 sec)

azangru · 2022-10-21T09:09:28Z

graphql_service/resolver/gene_model.py

+    query = {
+        "type": "Region",
+        "genome_id": by_genome_id["genome_id"],
+        "code": "chromosome",  # We filter on chromosomes to avoid returning an overwhelming number of regions


Oh, do you have an overwhelming number of regions? Then this is likely another of our "long lists of things" cases, and we should give the client an option to request different kinds of regions. Possibly making it an optional query parameter, in the absence of which Thoas will respond with top-level regions.

By the way, not all organisms in our databases will have top-level regions that are chromosomes. @ens-st3 has chosen an organism (Chacoan peccary, I believe) whose genome has only been sequenced to a very provisional level (primary assemblies, is it?) for inclusion in Thoas as part of the next batch of species, to confirm that we are able to handle organisms like that. The code above won't be able to find top-level regions for such organisms.

This was because it was mentioned that there would potentially be millions of regions for some assemblies. I think a decision needs to be made about how we handle pagination before we can handle those very large numbers.

Do we know how many non-chromosomal regions the assemblies we know we're going to be handling in the short and mid term are going to have?

There's an attribute in the "attrib_type" table called "Top level" with description "Top Level Non-Redundant Sequence Region" which looks like it could be a better candidate to filter on than "chromosome"

MySQL [catagonus_wagneri_core_108_2]> select count() from seq_region sr, seq_region_attrib sra, attrib_type at where sr.seq_region_id = sra.seq_region_id and sra.attrib_type_id = at.attrib_type_id and at.code = 'toplevel';
+----------+
| count() |
+----------+
| 1183730 |
+----------+

Note I've added it to sta-6 just in case we choose to support these now. If there are similar genomes in rapid, which is prob unlikely then we will need to, but if not it's our choice.
Another 'interesting' example is very long top level sequences, eg
MySQL [monodelphis_domestica_core_108_1]> select max(sr.length) from seq_region sr, seq_region_attrib sra, attrib_type at where sr.seq_region_id = sra.seq_region_id and sra.attrib_type_id = at.attrib_type_id and at.code = 'toplevel';
+----------------+
| max(sr.length) |
+----------------+
| 748055161 |
+----------------+

Steve's first query shows that filtering on "toplevel" wouldn't product us from getting back an overwhelming number of regions, right?

It'll exclude the constituent parts of assemblies (contigs, clones etc) which will help massively. There should never be any annotation on these - they should only ever be on toplevel regions - and I suggest there's no reason for them to be in Thoas.
Is 1.2M toplevel sequences overwhelming ?

I'm not sure if 1.2M would be overwhelming. Users can already overwhelm the service if they want to by sending very deep queries. I agree with @bethflint , I think we should make a decision on pagination. That would give us a way to limit queries, and then we won't have to worry about them being too heavy.

Add region input queries

7b338d0

ahc87 requested review from azangru and sanjay-boddu October 20, 2022 14:06

ahc87 marked this pull request as draft October 20, 2022 15:40

azangru reviewed Oct 21, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add region input queries #97

Add region input queries #97

ahc87 commented Oct 20, 2022 •

edited

Loading

azangru Oct 21, 2022 •

edited

Loading

bethflint Oct 21, 2022

azangru Oct 21, 2022 •

edited

Loading

ens-st3 Oct 21, 2022

azangru Oct 21, 2022 •

edited

Loading

bethflint Oct 21, 2022

ahc87 Oct 21, 2022

ens-st3 Oct 21, 2022 •

edited

Loading

ens-st3 Oct 21, 2022

ahc87 Oct 21, 2022

ens-st3 Oct 21, 2022

ahc87 Oct 21, 2022

Add region input queries #97

Are you sure you want to change the base?

Add region input queries #97

Conversation

ahc87 commented Oct 20, 2022 • edited Loading

azangru Oct 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

azangru Oct 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

azangru Oct 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ens-st3 Oct 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahc87 commented Oct 20, 2022 •

edited

Loading

azangru Oct 21, 2022 •

edited

Loading

azangru Oct 21, 2022 •

edited

Loading

azangru Oct 21, 2022 •

edited

Loading

ens-st3 Oct 21, 2022 •

edited

Loading