Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synonym lookup super slow? How to fix? #2367

Open
edeutsch opened this issue Sep 7, 2024 · 4 comments
Open

Synonym lookup super slow? How to fix? #2367

edeutsch opened this issue Sep 7, 2024 · 4 comments

Comments

@edeutsch
Copy link
Collaborator

edeutsch commented Sep 7, 2024

I've noticed this for a while, but only posting now. Has anyone noticed that the Synonym lookup through the ARAX GUI is super slow? Try searching for metformin or ibuprofen or anything reasonably common, and I start hearing my CPU fans groaning and it takes 15+ seconds for something to appear. I assume this is either because so much data is returned or rendering the graph is so expensive or? I wonder if anyone else has this issue? And if anyone has ideas on how best to solve it? Return less data? Don't render the graph unless asked? This service was great when answers came back within a second, but now it's painful to use.

ideas?

@amykglen
Copy link
Member

amykglen commented Sep 8, 2024

yes, this started happening after we started using the SRI Node Normalizer's drug_chemical_conflate parameter, which made the clusters for certain drugs really big.

I definitely think it's the 'match graph' that's causing the issue (I think the acetaminophen graph has 10s of thousands of edges now) - I wonder if we could just not display the graph if it has more than some reasonable number of edges? not sure if there's an existing way to determine the number of edges without actually having to load all of them..

@isbluis
Copy link
Member

isbluis commented Sep 17, 2024

As a quick test in devLM, looking up metformin results in the following rough timings:

  • 13 seconds to receive JSON response (>77Mb)
  • 35 seconds to render full table, without displaying the Concept Graph
  • 55 seconds to display, including graph

@edeutsch
Copy link
Collaborator Author

oof, thanks. Yeah, I think we should put some effort into slimming down the response first somehow. And then maybe something on the front end.

@amykglen
Copy link
Member

ok, per discussion with @edeutsch and others today - I've added an optional max_synonyms parameter to the NodeSynonymizer's get_normalizer_results() (in master), which you can use like this:

synonymizer.get_normalizer_results(entities=DOID:14330, max_synonyms=2)

and which produces a truncated cluster like this one (I haven't shown the full knowledge_graph below, but it is also truncated to two nodes and only edges that connect those two nodes):

{
  "DOID:14330": {
    "id": {
      "identifier": "MONDO:0005180",
      "name": "Parkinson disease",
      "category": "biolink:Disease",
      "SRI_normalizer_name": "Parkinson disease",
      "SRI_normalizer_category": "biolink:Disease",
      "SRI_normalizer_curie": "MONDO:0005180"
    },
    "total_synonyms": 18,
    "categories": {
      "biolink:Disease": 18
    },
    "nodes": [
      {
        "identifier": "DOID:14330",
        "category": "biolink:Disease",
        "label": "Parkinson's disease",
        "major_branch": "DiseaseOrPhenotypicFeature",
        "in_sri": true,
        "name_sri": "Parkinson's disease",
        "category_sri": "biolink:Disease",
        "in_kg2pre": true,
        "name_kg2pre": "Parkinson's disease",
        "category_kg2pre": "biolink:Disease"
      },
      {
        "identifier": "MONDO:0005180",
        "category": "biolink:Disease",
        "label": "Parkinson disease",
        "major_branch": "DiseaseOrPhenotypicFeature",
        "in_sri": true,
        "name_sri": "Parkinson disease",
        "category_sri": "biolink:Disease",
        "in_kg2pre": true,
        "name_kg2pre": "Parkinson disease",
        "category_kg2pre": "biolink:Disease"
      }
    ],
    "knowledge_graph": {
      "nodes": {
        ...

so we were thinking the UI can decide how many nodes is reasonable to display in one cluster (e.g., 200?), and then call get_normalizer_results() with that number as max_synonyms. and maybe also provide a dropdown or the like that lets a user increase max_synonyms.

note that the top-level "categories" slot shown above that reports node counts by category includes counts for the full cluster, and I also added a top-level "total_synonyms" slot to make it easy to report how many nodes are in the full cluster.

let me know if I can do anything else!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants