Skip to content
This repository has been archived by the owner on Apr 12, 2023. It is now read-only.

Phenotype page: Display co-occurring phenotypes. #1538

Open
jmcmurry opened this issue Feb 9, 2018 · 24 comments
Open

Phenotype page: Display co-occurring phenotypes. #1538

jmcmurry opened this issue Feb 9, 2018 · 24 comments

Comments

@jmcmurry
Copy link
Member

jmcmurry commented Feb 9, 2018

Not super high priority, but for discussion... I could have sworn there was already a ticket for this but ...
for phenotype pages, I think it would be interesting to have a tab with other phenotypes that frequently co-occur. The table would also contain a column for the number of diseases in which they co-occur.

@jmcmurry jmcmurry added the R24 label Feb 9, 2018
@kshefchek
Copy link
Contributor

@kshefchek
Copy link
Contributor

@jmcmurry is there a specific item in the R24 for this or is it more towards a general aim? I've done some work computing similarity coefficients, p-values, and mocked up some methods to include frequency classes from the HPO in the analysis. Some of this is in the notebook above and some is off line. I'm not sure if this is useful or overkill for what we'll get from this analysis.

@jmcmurry
Copy link
Member Author

Hannah advises: Look at "Market basket analysis"

@kshefchek
Copy link
Contributor

here is an update (rename as .html)
phenotype-co-occurrence.txt

@kshefchek
Copy link
Contributor

kshefchek commented Mar 29, 2018

The feedback I'm interested in:

  • Are the computations for normalization and p-values sound?
  • Is weighting on HPO frequency terms sound, this makes the code more complex and would force us to run and cache the computations ahead of time
  • What types of visualizations would be useful?
  • What other datatypes should be included. For example, we could add distance between terms using IC, this would be useful in searching for pleiotropy.

@pnrobinson
Copy link
Member

I would say that co-occurence is more of a FYI than a statistical test, and I am not sure that users would know what to do with the information about p-values. A more important questions is how to deal with the implicit annotations. I would find it useful to know what overall categories are frequently shared, but at least on the website am less convinced people would want to see long lists of shared terms.

@kshefchek
Copy link
Contributor

kshefchek commented Mar 29, 2018

thanks this is helpful! the p-value is more for establishing a cut-off in the case that it's not obvious how to interpret some normalization of the data, compared to a correlation coefficient where you might set the cutoff at abs(.7). But see your point and thought this might be excessive.

EDIT: disregard original comment on implicit phenotype co-counts, I went about this incorrectly.

@kshefchek
Copy link
Contributor

I've reworked the code for generating implicit co-occurrence data.

Code: It takes ~20 minutes and requires ~85g of memory. Would be interesting to see how this would perform in another language (e.g. Julia).
Output: https://data.monarchinitiative.org/analysis/co-occurrence/co-occur.tsv

The top co-occurring count is greater than the number of diseases with phenotype annotations. The alternative would be convert all phenotypes and their closures into a single set per disease, but then we're missing co-occurrence on all implicit classes when two explicit terms share a common ancestor.

Next step would be to account for terms in the same lineage, or alternatively only consider terms with the same distance from the root class.

@cmungall
Copy link
Member

hmm, I've managed to do this using ontobio on a laptop before

@TomConlin
Copy link

TomConlin commented Mar 31, 2018

filtering redundant phenotypes as early as possible is key.
on the server in solr would be ideal because then you would not transmit them
but even making each list a set will do.
takes about 6 min on my machine with your code
and five minutes is loading the data,
so a rewrite in julia is likely to save !!!DOZENS!!! of seconds

note I have no permission to push to that repo but

-closure_list = [closures for closures in closure_map.values()]
+closure_list = [set(closures) for closures in closure_map.values()]

and you will be under 10G and 2 min processing
(loading stays the same)

@pnrobinson
Copy link
Member

It seems odd this is requiring so heavy computational resources. I had a prototype solution in Java that also arranged things according to category (but did not calculate p values) that was pretty fast. I need to refactor it after having refactored everything else to use phenol, but certainly 85g/20 minutes are excessive. It would be good to collaborate more on code like this, why don't you take a look at HPO Workbench and see if that starts to fulfil the requirements?

@kshefchek
Copy link
Contributor

kshefchek commented Mar 31, 2018

@TomConlin I think it depends what we want out of the analysis. If a disease is annotated to 'abnormal optic nerve' and 'abnormal neuron', would I want to capture that 'abnormality of the nervous system' co-occurs with itself once in this disease? If we convert the implicit classes to a set we miss this. This is why the top count in the tsv is much higher than the total count of diseases.

@pnrobinson the code here looks at co-occurrence of every explicit and implicit class all the way up to HP:0000001 (which is unnecessary). If we were to look at just a subset of categorical phenotypes it would be far less resource hungry.

@TomConlin
Copy link

TomConlin commented Mar 31, 2018

to capture what a disease is annotated to, we would have to distinguish the terms from all their included ancestors. converting to a set means HP:0000118 shows up once per disease instead of ~25 times per disease. That is; you still get your disease associated with 'abnormality of the nervous system' but only once.

@kshefchek
Copy link
Contributor

HP:0000118 isn't a great example because it doesn't make sense to capture, but say I have a disease annotated to 25 phenotypes that are all subclasses of 'nervous system abnormality', how many times does 'nervous sys abnormality' co-occur within that disease?

@TomConlin
Copy link

I am content with once.
I don't get a new grandparent through each of my cousins.

@pnrobinson
Copy link
Member

With the inherited annotations, you need to count them only once per disease. That is, if a patient has abn of the brain, and abn of the spinal cord, this would naively result in two inferred annotations for abn or the nervous system, but this is wrong, because according to the HPO model the annotation needs to be counted only once.

@kshefchek
Copy link
Contributor

Okay I was going about this wrong then!

@TomConlin
Copy link

It could be interesting to look at from the phenotype ancestor "score card" point of view.
but you still would not compute over them just split out a count to be looked up later.

@kshefchek
Copy link
Contributor

If it's interesting I will leave it as an option and compute it both ways. I understand everyone’s point that you can only be in one of two states of abnormality at the system level (present/absent). But say a patient presents with a mole on their arm, and an abscess on their thigh, would we not say they have two skin abnormalities occurring together?

@pnrobinson
Copy link
Member

imho that is not the context of this approach-- we are not talking about what is happening in an individual patient, we are talking about whether any two diseases share an abnormality. I think it would be just confusing to double count in this way.

@kshefchek
Copy link
Contributor

we are talking about whether any two diseases share an abnormality

It sounds like the way I'm calculating this is fundamentally wrong, as I'm looking at phenotypes occurring within the same disease.

@pnrobinson
Copy link
Member

I see -- I would say not wrong but a different calculation. I was thinking that we take all diseases that have HPO:X and then ask what the most common co-occuring terms are. Possibly both calculations are interesting....

@cmungall
Copy link
Member

cmungall commented Apr 2, 2018

Here is what I think should be done.

This should be done as a standard enrichment test between two gene sets. i.e. a fisher exact test for genes in P1 vs genes in P2, with appropriate correction for multiple tests. Skip tests if P1 and P2 are mutual ancestors/descendants.

Note this will give you a lot of significant matches between siblings and grandsiblings etc, so the appropriate background test is the set of all genes in the MRCAs of P1 and P2. The goal is to find latent connections not already in the ontology.

As far as implementation, I would avoid any direct computation in solr. Just load everything into main memory and do the calculations there with any necessary optimizations. The language is largely irrelevant, but note that ontobio has all the necessary calls to load into an association object any set of annotations in monarch, so the same analysis could be repeated for human PxP with genes, PxP with diseases, mouse PxP, PxP with orthologous genes (ie phenologs), PxGO, DxGO etc.

@kshefchek
Copy link
Contributor

@cmungall can you look at the notebook here https://github.com/monarch-initiative/monarch-app/issues/1538#issuecomment-377278609 and comment if I'm setting up the fisher exact test correctly? I think we're on the same page but not certain. In your example does the intersection of diseases annotated to P1|P2 go in the 2x2 table?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants