Requires module networkx
. Tested with verion 1.9.1
Requires module pickle
Requires modules sys
, time
, and math
, which should be installed with python by default.
Requires module numpy
See http://bib.oxfordjournals.org/content/13/5/569.full for definitions metric definitions.
This function takes in the file name for a GO ontology file (obo format). The file must be of a specific format:
! comments
[Term]
id: GO_term
...
is_a: GO_term
is_a: GO_term
[Term]
id: GO_term
....
[Typedef]
...
Note: The [Typedef]
tag signals the end of GO terms, and is required (otherwise, the parser will fail to record the final GO term in the ontology file)
Please see example_go.obo
for a full example file.
parse_go_file
returns two objects as a tuple:
-
go_graph
: A networkx DiGraph object.go_graph
represents the ontology as a DiGraph, where eachis_a
relationship is represented as an edge. -
alt_ids
: GO ontology files provide alternate IDs for some terms (represented byalt_id:
lines in the ontology file).alt_ids
is a mapping from alternate IDs to the IDs stored ingo_graph
.alt_ids
is a python dictionary, where keys are GO IDs not ingo_graph
, and values are corresponding GO IDs ingo_graph
This function takes in a file name for a pre-processed annotation corpus file of a specific format:
-
protein_name
GO_term
GO_term
GO_term
-
...
-
Note: File must both start and end with a -
line. Please see example_corpus.stripped
for a full example of a pre-processed annotation corpus file.
parse_annotation_corpus
returns two objects in a tuple:
prot_to_gos
: This is a python dictionary mapping protein names to GO terms. Keys are protein names, values are python lists of GO terms associated with the key (from the annotation corpus)go_to_prots
: This is a python dictionary mapping GO terms to protein names. Keys are GO terms, and values are python lists of protein names labeled with the key (from the annotation corpus)
If alt_ids
is provided, then any keys in alt_ids
that appear in the annotation corpus will be stored as their associated values in alt_ids
.
This function takes in a file path to a pickled SemSimCalculator
object.
It returns an unpickled SemSimCalculator
object
The SemSimCalculator
class takes an ontology and an annotation corpus. It parses and uses these to calculate various semantic similarity metrics between terms, groups of terms, and proteins.
All class variables are technically public, but should be treated as private. Use the getter functions explained below to access them. Class variables:
-
_go_graph
networkx DiGraph representing the GO ontology, as parsed/returned byparse_go_file
-
_alt_list
python dictionary represting alternate GO term IDs, as parsed/returned byparse_go_file
-
_prot_to_gos
python dictionary mapping protein names to their GO term labels, as parsed/returned byparse_annotation_corpus
-
_go_to_prots
python dictionary mapping GO terms to the proteins which they label, as parsed/returned byparse_annotation_corpus
-
_proteins
python list of proteins names. Contains names of all proteins that have labels -
_num_proteins
integer. Size ofproteins
-
_ic_vals
dictionary mapping GO term to its IC (information content) value. Initialized empty. Used for memoization -
_go_terms
list of all GO terms in the graph of the ontology -
_mica_store
reference to aMicaStore
instance. Initialized asNone
, must be set manually
Creates new instance. Call semsimcalc.SemSimCalculator(file_name, file_name)
to use. Will return a SemSimCalculator
object.
Takes in file names for GO ontology file (obo format) and annotation corpus file (pre-processed file of the same format that parse_annotation_corpus
takes, as explained above).
Initializes go_graph
, alt_list
, prot_to_gos
, go_to_prots
, proteins
, and num_proteins
.
Creates ic_vals
as an empty dictionary.
Saves a reference to a MicaStore
instance
Removes _mica_store
reference (sets to None
)
Pickles and saves self
to filepath
Return copy of _go_graph
Return copy of `_alt_list
Return copy of _prot_to_gos
Return copy of _go_to_prots
Return copy of _ic_vals
Note: get_ic_vals
does not inherently calculate IC values. Use precompute_ic_vals
first if you need all IC values.
Return copy of _go_terms
Return reference to _mica_store
Takes in a GO term as a string. Calculates and returns the probability of that term or any of that term's descendants (in the GO DiGraph) occuring in the annotation corpus. That is: [number of proteins labeled with term or a descendant of term] / [number of labeled proteins in annotation corpus]
Takes in a GO term as a string. Calculates and returns the information content of that term. Information content is defined (within this implementation) as:
-ln(prob(term))
Where prob(term) is the same as the result of calling calc_term_probability(term)
Note: Once an IC is calculated, it is stored in _ic_vals
. Subsequent calls for the IC of the same term only look up the recorded value.
Fills the _ic_vals
dictionary used for information content memoization.
Runs IC
on all terms in the GO ontology.
Takes in two GO terms as strings. Order doesn't matter. Calculates and returns the Maximum Informative Common Ancestor. (Returns a GO term as a string)
The MICA of two terms is the common ancestor of both terms with the highest information content value.
Note: For this implementation, if left and right are the same, they are included in the list of "common ancestors."
If a MicaStore
instance is linked (through link_mica_store
), MICA
first queries the MicaStore
instance. Only if the MicaStore
instance does not return a GO term does MICA
calculate a result from the GO graph and annotation corpus.
Takes in two GO terms as strings. Order doesn't matter. Calculates and returns the resnik score of the two terms. (Returns a float)
simRes is defined as the information content of the MICA of two terms. See here for more details.
Takes in two GO terms as strings. Order doesn't matter. Calculates and returns the Lin score for the two terms. (Returns a flot)
simLin is defined as the simRes of two terms divided by the sum of the information contents for each term (left and right). See here for more details.
Takes in two GO terms as strings. Order doesn't matter. Calculates and returns the Jiang-Conrath score for two terms is defined as:
1 - IC(left) + IC(right) - 2 * simRes(left, right)
See here for more details.
Note: Currently untested
Example for proper function calls:
Assume calc
is a SemSimCalculator
instance:
calc.pairwise_average_term_comp(left_term, right_term, calc.simRes)
Will calculate and return the average of all pairwise resnik scores for the given lists of GO terms, left_terms
and right_terms
.
Takes in two python lists of GO terms (lefts
, rights
) and a comparison metric (ex. any function from the "Comparison Metrics" section). metric
must take in two ontology terms and return a numeric score.
Returns the average of all pairwise term comparisons, using metric
.
Takes in two python lists of GO terms (lefts
, rights
) and a comparison metric (ex. any function from the "Comparison Metrics" section). metric
must take in two ontology terms and return a numeric score.
Returns the max of all pairwise term comparisons, using metric
.
Example for proper function calls:
Assume calc
is a SemSimCalculator
instance:
calc.average_protein_comp(left_prot, right_prot, calc.simRes)
Will calculate and return the average of all pairwise resnik scores for the go terms associated with the two protein names, left_prot and right_prot.
Takes in two protein names as strings (left_prot, right_prot) and a reference to a comparison metric (ex. any function from the "Comparison Metrics" section). metric
must take in two ontology terms and return a numeric score.
Returns the average of all pairwise term comparisons for the GO terms associated with the proteins left_prot
and right_prot
using metric
.
Takes in two protein names as strings (left_prot, right_prot) and a reference to a comparison metric (ex. any function from the "Comparison Metrics" section). metric
must take in two ontology terms and return a numeric score.
Returns the max of all pairwise term comparisons for the GO terms associated with the proteins left_prot
and right_prot
using metric
.
Wrapper class for a numpy matrix of MICA values.
Takes a numpy matrix of MICA values and an ordering of GO terms (indices in the matrix).
Class variables:
-
_micas
numpy
matrix of mica values -
_go_to_index
dictionary mapping GO terms to indices in the matrix (matrix must be symmetrical)
Loads numpy
matrix from matrix_filename
into _micas
.
Parses ordered list of GO terms from ordering_filename
(one term per line).
Returns reference to numpy
array _micas
Returns copy of _go_to_index
dictionary
If term
is in _go_to_index
, return _go_to_index[term]
, which corresponds to term
's index in _micas
If term
is not in _go_to_index
, return None
Attempts to look up a MICA value from _micas
.
If MICA value cannot be found (or left
or right
are not in _go_to_index
), returns None
Standalone script to strip down a Swiss-Prot text file (".dat"). See http://www.uniprot.org/downloads for download location.
Only tested on Swiss-Prot currently.
To run:
python strip_ac.py -i [filename] -o [filename]
Where:
-i
takes the file name for the original Swiss-Prot .dat file-o
takes the file name for the output (stripped) file.
The output of this script is compatible with semsimcal.parse_annotation_corpus
These files are examples of format. example_corpus.dat
corresponds with example_corpus.stripped
, but not with example_go.obo