Phenotype page: Display co-occurring phenotypes. #1538

jmcmurry · 2018-02-09T05:45:34Z

Not super high priority, but for discussion... I could have sworn there was already a ticket for this but ...
for phenotype pages, I think it would be interesting to have a tab with other phenotypes that frequently co-occur. The table would also contain a column for the number of diseases in which they co-occur.

kshefchek · 2018-02-15T22:26:53Z

See http://nbviewer.jupyter.org/github/monarch-initiative/monarch-analysis/blob/master/notebooks/phenotype-co-occurrence.ipynb for some back end ideas.

kshefchek · 2018-03-28T23:45:44Z

@jmcmurry is there a specific item in the R24 for this or is it more towards a general aim? I've done some work computing similarity coefficients, p-values, and mocked up some methods to include frequency classes from the HPO in the analysis. Some of this is in the notebook above and some is off line. I'm not sure if this is useful or overkill for what we'll get from this analysis.

jmcmurry · 2018-03-29T15:36:30Z

Hannah advises: Look at "Market basket analysis"

kshefchek · 2018-03-29T15:44:42Z

here is an update (rename as .html)
phenotype-co-occurrence.txt

kshefchek · 2018-03-29T16:21:18Z

The feedback I'm interested in:

Are the computations for normalization and p-values sound?
Is weighting on HPO frequency terms sound, this makes the code more complex and would force us to run and cache the computations ahead of time
What types of visualizations would be useful?
What other datatypes should be included. For example, we could add distance between terms using IC, this would be useful in searching for pleiotropy.

pnrobinson · 2018-03-29T19:45:08Z

I would say that co-occurence is more of a FYI than a statistical test, and I am not sure that users would know what to do with the information about p-values. A more important questions is how to deal with the implicit annotations. I would find it useful to know what overall categories are frequently shared, but at least on the website am less convinced people would want to see long lists of shared terms.

kshefchek · 2018-03-29T23:48:58Z

thanks this is helpful! the p-value is more for establishing a cut-off in the case that it's not obvious how to interpret some normalization of the data, compared to a correlation coefficient where you might set the cutoff at abs(.7). But see your point and thought this might be excessive.

EDIT: disregard original comment on implicit phenotype co-counts, I went about this incorrectly.

kshefchek · 2018-03-31T01:07:04Z

I've reworked the code for generating implicit co-occurrence data.

Code: It takes ~20 minutes and requires ~85g of memory. Would be interesting to see how this would perform in another language (e.g. Julia).
Output: https://data.monarchinitiative.org/analysis/co-occurrence/co-occur.tsv

The top co-occurring count is greater than the number of diseases with phenotype annotations. The alternative would be convert all phenotypes and their closures into a single set per disease, but then we're missing co-occurrence on all implicit classes when two explicit terms share a common ancestor.

Next step would be to account for terms in the same lineage, or alternatively only consider terms with the same distance from the root class.

cmungall · 2018-03-31T02:05:26Z

hmm, I've managed to do this using ontobio on a laptop before

TomConlin · 2018-03-31T07:57:30Z

filtering redundant phenotypes as early as possible is key.
on the server in solr would be ideal because then you would not transmit them
but even making each list a set will do.
takes about 6 min on my machine with your code
and five minutes is loading the data,
so a rewrite in julia is likely to save !!!DOZENS!!! of seconds

note I have no permission to push to that repo but

-closure_list = [closures for closures in closure_map.values()]
+closure_list = [set(closures) for closures in closure_map.values()]

and you will be under 10G and 2 min processing
(loading stays the same)

pnrobinson · 2018-03-31T13:55:09Z

It seems odd this is requiring so heavy computational resources. I had a prototype solution in Java that also arranged things according to category (but did not calculate p values) that was pretty fast. I need to refactor it after having refactored everything else to use phenol, but certainly 85g/20 minutes are excessive. It would be good to collaborate more on code like this, why don't you take a look at HPO Workbench and see if that starts to fulfil the requirements?

kshefchek · 2018-03-31T17:04:27Z

@TomConlin I think it depends what we want out of the analysis. If a disease is annotated to 'abnormal optic nerve' and 'abnormal neuron', would I want to capture that 'abnormality of the nervous system' co-occurs with itself once in this disease? If we convert the implicit classes to a set we miss this. This is why the top count in the tsv is much higher than the total count of diseases.

@pnrobinson the code here looks at co-occurrence of every explicit and implicit class all the way up to HP:0000001 (which is unnecessary). If we were to look at just a subset of categorical phenotypes it would be far less resource hungry.

TomConlin · 2018-03-31T17:29:55Z

to capture what a disease is annotated to, we would have to distinguish the terms from all their included ancestors. converting to a set means HP:0000118 shows up once per disease instead of ~25 times per disease. That is; you still get your disease associated with 'abnormality of the nervous system' but only once.

kshefchek · 2018-03-31T18:03:19Z

HP:0000118 isn't a great example because it doesn't make sense to capture, but say I have a disease annotated to 25 phenotypes that are all subclasses of 'nervous system abnormality', how many times does 'nervous sys abnormality' co-occur within that disease?

TomConlin · 2018-03-31T18:08:39Z

I am content with once.
I don't get a new grandparent through each of my cousins.

pnrobinson · 2018-03-31T18:08:43Z

With the inherited annotations, you need to count them only once per disease. That is, if a patient has abn of the brain, and abn of the spinal cord, this would naively result in two inferred annotations for abn or the nervous system, but this is wrong, because according to the HPO model the annotation needs to be counted only once.

kshefchek · 2018-03-31T18:10:36Z

Okay I was going about this wrong then!

TomConlin · 2018-03-31T18:21:01Z

It could be interesting to look at from the phenotype ancestor "score card" point of view.
but you still would not compute over them just split out a count to be looked up later.

kshefchek · 2018-03-31T18:55:08Z

If it's interesting I will leave it as an option and compute it both ways. I understand everyone’s point that you can only be in one of two states of abnormality at the system level (present/absent). But say a patient presents with a mole on their arm, and an abscess on their thigh, would we not say they have two skin abnormalities occurring together?

pnrobinson · 2018-03-31T19:02:02Z

imho that is not the context of this approach-- we are not talking about what is happening in an individual patient, we are talking about whether any two diseases share an abnormality. I think it would be just confusing to double count in this way.

kshefchek · 2018-03-31T20:39:36Z

we are talking about whether any two diseases share an abnormality

It sounds like the way I'm calculating this is fundamentally wrong, as I'm looking at phenotypes occurring within the same disease.

pnrobinson · 2018-03-31T21:09:35Z

I see -- I would say not wrong but a different calculation. I was thinking that we take all diseases that have HPO:X and then ask what the most common co-occuring terms are. Possibly both calculations are interesting....

cmungall · 2018-04-02T15:49:59Z

Here is what I think should be done.

This should be done as a standard enrichment test between two gene sets. i.e. a fisher exact test for genes in P1 vs genes in P2, with appropriate correction for multiple tests. Skip tests if P1 and P2 are mutual ancestors/descendants.

Note this will give you a lot of significant matches between siblings and grandsiblings etc, so the appropriate background test is the set of all genes in the MRCAs of P1 and P2. The goal is to find latent connections not already in the ontology.

As far as implementation, I would avoid any direct computation in solr. Just load everything into main memory and do the calculations there with any necessary optimizations. The language is largely irrelevant, but note that ontobio has all the necessary calls to load into an association object any set of annotations in monarch, so the same analysis could be repeated for human PxP with genes, PxP with diseases, mouse PxP, PxP with orthologous genes (ie phenologs), PxGO, DxGO etc.

kshefchek · 2018-04-02T15:56:28Z

@cmungall can you look at the notebook here https://github.com/monarch-initiative/monarch-app/issues/1538#issuecomment-377278609 and comment if I'm setting up the fisher exact test correctly? I think we're on the same page but not certain. In your example does the intersection of diseases annotated to P1|P2 go in the 2x2 table?

jmcmurry added the R24 label Feb 9, 2018

jmcmurry assigned jmcmurry and lwinfree Mar 29, 2018

jmcmurry added the Analyze Phenotypes label Jun 4, 2018

jmcmurry added the UX and feedback label Jun 4, 2018

kshefchek unassigned lwinfree Jul 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phenotype page: Display co-occurring phenotypes. #1538

Phenotype page: Display co-occurring phenotypes. #1538

jmcmurry commented Feb 9, 2018

kshefchek commented Feb 15, 2018

kshefchek commented Mar 28, 2018

jmcmurry commented Mar 29, 2018

kshefchek commented Mar 29, 2018

kshefchek commented Mar 29, 2018 •

edited

Loading

pnrobinson commented Mar 29, 2018

kshefchek commented Mar 29, 2018 •

edited

Loading

kshefchek commented Mar 31, 2018

cmungall commented Mar 31, 2018

TomConlin commented Mar 31, 2018 •

edited

Loading

pnrobinson commented Mar 31, 2018

kshefchek commented Mar 31, 2018 •

edited

Loading

TomConlin commented Mar 31, 2018 •

edited

Loading

kshefchek commented Mar 31, 2018

TomConlin commented Mar 31, 2018

pnrobinson commented Mar 31, 2018

kshefchek commented Mar 31, 2018

TomConlin commented Mar 31, 2018

kshefchek commented Mar 31, 2018

pnrobinson commented Mar 31, 2018

kshefchek commented Mar 31, 2018

pnrobinson commented Mar 31, 2018

cmungall commented Apr 2, 2018

kshefchek commented Apr 2, 2018

Phenotype page: Display co-occurring phenotypes. #1538

Phenotype page: Display co-occurring phenotypes. #1538

Comments

jmcmurry commented Feb 9, 2018

kshefchek commented Feb 15, 2018

kshefchek commented Mar 28, 2018

jmcmurry commented Mar 29, 2018

kshefchek commented Mar 29, 2018

kshefchek commented Mar 29, 2018 • edited Loading

pnrobinson commented Mar 29, 2018

kshefchek commented Mar 29, 2018 • edited Loading

kshefchek commented Mar 31, 2018

cmungall commented Mar 31, 2018

TomConlin commented Mar 31, 2018 • edited Loading

pnrobinson commented Mar 31, 2018

kshefchek commented Mar 31, 2018 • edited Loading

TomConlin commented Mar 31, 2018 • edited Loading

kshefchek commented Mar 31, 2018

TomConlin commented Mar 31, 2018

pnrobinson commented Mar 31, 2018

kshefchek commented Mar 31, 2018

TomConlin commented Mar 31, 2018

kshefchek commented Mar 31, 2018

pnrobinson commented Mar 31, 2018

kshefchek commented Mar 31, 2018

pnrobinson commented Mar 31, 2018

cmungall commented Apr 2, 2018

kshefchek commented Apr 2, 2018

kshefchek commented Mar 29, 2018 •

edited

Loading

kshefchek commented Mar 29, 2018 •

edited

Loading

TomConlin commented Mar 31, 2018 •

edited

Loading

kshefchek commented Mar 31, 2018 •

edited

Loading

TomConlin commented Mar 31, 2018 •

edited

Loading