Skip to content

pporrasebi/darkspaceproject

Repository files navigation

darkspaceproject

Trying to find out how much is out there that needs curation...

Description and goals

This project aims to try to find out how much published molecular interaction data is out there that has not been curated by molecular interaction databases, particularly IMEx-complying ones. Several different strategies will be used for this purpose and an integrated view of the estimation will be produced in the form of a table.

Strategies list

This a list of the different approaches considered in order to get an estimation. Some of these will not provide with potential interactions to curate, but they can be used to rank lists generated with other approaches.

List of external sources

  • Pathway inference:

    • Reactome inferred pairs:

      • Protein pairs are inferred by their association to a reaction as input, output or catalyst. The dataset is generated by the Reactome (www.reactome.org) team and provides pairs with associated PMIDs for each one of them.
      • Data is not socred
    • STRING pathway-inference:

      • STRING (https://string-db.org/) takes protein-protein associations taken from pathway databases such as KEGG or Reactome and infers that they represent protein-protein interactions.
      • Mining PMIDs from STRING data seems to be quite complicated and we have ignored them for any datasets taken from STRING.
      • Associations are scored in every STRING-derived dataset.
    • OmniPath inferences:

      • OmniPath (http://omnipathdb.org/) is a comprehensive collection of literature curated human signaling pathways.
      • We select data representing interactions (in the broadest sense) and also post-translational modification, since it contains enzyme-substrate relationships.
      • Data is not scored, but it is associated to PMIDs when possible.
  • Text-mining approaches:

    • Text-mining EPMC: Collaboration with Senay Kafkas (EPMC).

      • Her pipeline produces as output a list of UniProtKB accessions for pairs of genes/proteins co-occurring in the same sentence where a particular term (selected from a pre-made list) was found and selecting for publications for which terms depicting selected interaction detection methods are found.
      • Interaction detection terms have been selected from the PSI-MI controlled vocabulary, filtering for those methods that are almost exclusively used to detect interactions (e.g. yeast two-hybrid or co-immunoprecipitation) and discarding those that have common broader applications (e.g. ELISA).
      • No score is provided, but the PMIDs where they were found are indicated and the number of times the pair and the detection method term was found is also provided.
      • The search is done through the full text when the paper is open access (PMC) or only abstract otherwise. Around 800,000 publications are scanned.
      • A success rate of about 60% was observed regarding identification of publications effectively containing experimental interaction data (ignoring protein pair identification).
    • EVEX (http://evexdb.org/):

      • EVEX is a text-mining resource which aims to identify interactions of different types from the literature as well as segment those interactions by type and measure the confidence of those in interactions being really described in the articles (not an artifact of text-mining).
      • An interaction is identified by a pair of genes (in the network format, there is another data format in EVEX which is not relevant for us), segmented by type and polarity, and given a confidence score. PMIDs for each association are provided.
    • STRING text-mined data:

      • STRING (https://string-db.org/) takes text-mining as one of the many sources of the protein-protein interaction data it provides. It derives associations by just finding co-occurrence of any two proteins in an abstract. Only associations also found using other approaches are taken into account.
      • As stated above, no PMIDs are provided for STRING-derived datasets, but pairs are scored.
  • Predicted interactions:

    • IID-predicted: The Integrated Interactions Database (IID) is a predictive meta-database built in Igor Jurisica's lab (http://dcv.uhnres.utoronto.ca/iid/).

      • It comprises data obtained from primary databases (IMEx consortium, BioGRID, HPRD...) plus computational predictions.
      • We use the subset of data that has been produced by computational prediction and orthologous similarity.
    • STRING phylogeny-predicted data:

      • STRING (https://string-db.org/) predicts protein interactions using inference from experimental evidence provided by interacting pairs of phylogenetically-related organisms.
      • As stated above, no PMIDs are provided for STRING-derived datasets, but pairs are scored.

Interaction reference sets

These resources contain actual protein interaction data and can be used as reference to estimate if the external sources can actually provide molecular interaction data.

  • IMEx dataset:

    • Our own molecular interaction data, blending all IMEx consortium databases curating into IntAct, plus DIP (soon to be updated, once the DIP import is complete.
  • BioGRID data (https://thebiogrid.org/):

    • We need to identify how much information is curated in BioGrid as well, since any predicted interaction will be prioritized down if it is already curated there.
    • Since BioGRID maps its proteins to EntrezGeneIDs, we take its data from mentha, which has already translated them into UniProt accessions.
  • GO IPIs:

    • 'Inferred from Protein Interaction' annotations made by GO curators. Available through a new PSICQUIC server (EBI-GOA-nonIntAct).

Strategies we have considered, but have not been implemented

  • Laitor/PESCADOR/MedLine Ranker: (Miguel Andrade and Adriano Barbosa) Text mining tools that allow ranking of the term co-occurrences.
  • UniProtKB CC subunit lines: They detail complexes internal interactions that might not be curated by IMEx.
  • Structure predictions using LAMs (Local Approximate Models)
  • Genetic interactions: Although the overlap between protein and genetic interactions is negligible, it might help identifying those predicted interactions that have higher biological interest.
  • Negatome (http://mips.helmholtz-muenchen.de/proj/ppi/negatome/), a database of verified negative interactions, can also be used to filter out spurious associations.

Technical reports

To read specific information about how each dataset has been handled and integrated, please use the following links:

Ranking pipeline evaluation

This project has been complemented with a ranking pipeline based on a random forest algorithm implementation and the definition of two different scoring systems developed by Miguel Vázquez. Details about the ranking pipeline can be found here: https://github.com/Rbbt-Workflows/DarkSpace.

The two scores used were defined as follows:

  • Relevance: ranks by likelihood of publication containing interactions, uses TM datasets scores and RandomForest
  • Interest: Ranks by representation of proteins in interaction datasets, uses weighted frequencies of protein occurrence

Detailed evaluations of the pipeline performance were done and can be found under https://github.com/Rbbt-Workflows/DarkSpace/tree/master/manual_evaluation. 5 different rounds of manual evaluation were performed, concluding that the pipeline does a reasonable job of sorting out publications by likelihood of containing interaction data (relevance score), but a poor on at identifying relevant pairs of interacting proteins (interest score). Latest reports can be downloaded in html format from the folder mentioned above. Rounds 4 and 5 evaluate identical versions of the pipeline run in two separate instances and there is a report on stochastic effects on the scorings at https://github.com/Rbbt-Workflows/DarkSpace/blob/master/manual_evaluation/comp_rel_rd4_rd5.html.

The results of this pipeline are then manually uploaded on to Google docs for IMEx curator to use them. Details about the comparison and the production of the google file can be found in https://github.com/pporrasebi/darkspaceproject/blob/master/dsp_comparison/dsp_comparison_final_v3.md.

Manual evaluations

This is the link to the Google Spreadsheet for all manual evaluations: https://docs.google.com/spreadsheets/d/1tL1HtVD3-BxHxKuXbIYhcFjmCptGVEOD5aFJCZw6CZk/edit#gid=343089856. It has the following sheets:

  • dsp_priority_updated: The most up to date sheet to be used as input for manual curation. Features a number of columns with information about data sources, MeSH terms annotation for each publication and relevance and interest scores, plus predicted interacting pairs.
  • All_PMIDs_checked: Full list of all PubMed IDs manually evaluated for presence of curatable interaction data.
  • DSP manual requests: Manually updated list of specific publication requests from users and colleagues.
  • MeSH terms list: Full list of MeSH terms used to annotate the dsp_priority_updated list.
  • Low-hanging fruit: List of publications and protein pairs absent from IMEx databases, but present in Reactome and detected by both IID predictions and EPMC text-mining pipeline. Produced for the first version of the dsp_comparison sheet.
  • Eval_Reactome: List of 100 random PMIDs found in Reactome for manual evaluation.
  • Eval_TM_EPMC,Eval_TM_EPMC_it2, Eval_TM_EPMC_it3, Eval_TM_EPMC_it4: Series of sheets with manual evaluation of PMIDs containing interaction data as derived from the EPMC text mining pipeline. Every version is a further iteration of the pipeline, it4 being the best and final one.
  • Proteoglycan interactions: Manually defined list of publications, proteins and glycans potentially containing proteoglycan interactions, produced by Sylvie Ricard-Blum.
  • dsp_priority, dsp_priority_rev, dsp_priority_rev_pred, dsp_priority_rev_noalzh, dsp_biogrid_go: These are all hidden tables and can only be seen if unhidden. They are all previous versions of dsp_priority_updated, featuring a varying number of columns. The dsp_priority_rev_noalzh is a list where Alzheimer's Disease-related publications, as defined by MeSH term annotation, have been removed for a specific project, now already finalised.

Pending tasks

  • The results from the last iteration of the evaluation pipeline, produced in January 2020, were never implemented in the dsp_priority_updated sheet. It is not critical to do this, given the minor differences in the ranking pipeline, but might be advisable in the future.
  • IMExCentral features afull list of publication checked for curatable information and found to be negative for that. This could potentially be added to the tables in the project and used for training new versions of the algorithm. Also, Lukasz and other members of DIP went through parts of the dsp_priority_updated list and recorded their evaluation in IMExCentral, so this needs to be brought up as well. * A variation of the ranking pipeline with updated scorings has been discussed with Miguel Vázquez, but never pursued.

About

Trying to find out how much is out there that needs curation...

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages