Biomedical Engineering Reference
In-Depth Information
to arise from evolutionary conservation, such as that between homologues. The simplest
forms of protein function inference given the sequence of a protein would be to search
for likely homologues using BLAST and examine the annotations of these homologous
proteins. Building on this principle, an array of tools has been developed to extend this
concept in different directions and degrees to improve functional inference using sequence
homology.
9.2.3.2 Automated function prediction from homologues
GoFigure [2] is one of the earliest among such tools. Given the sequence of a protein,
GoFigure first performs a homology search using BLAST to find GO-annotated proteins
with similar sequences. The sub-graph of the GO DAG with the greatest depth from the root
that includes all GO terms assigned to these proteins is then identified. This graph is termed
the minimum covering graph (MCG). Each term in this MCG is then assigned a weighted
score derived from alignments to proteins with the term. The amount of contribution of
an alignment is inversely related to its E -value. The score for each term in the MCG
is then normalized by dividing it by that of the root. Terms with a normalized score of
0.2 and greater are then reported as inferred annotations for the query protein. GoFigure
provides a systematic way of assigning weighted GO terms to a query protein based on
homology search, giving higher weight to terms annotated to more proteins in the search
results, as well as terms associated with more significant alignments. GoFigure was initially
available at http://udgenome.ags.udel.edu/frm_go.html/, but is no longer available at the
time of writing.
GOblet [1] is another tool that automates GO term inference from BLAST searches.
The newest version of GOblet includes statistical analysis on the terms associated with
proteins found. Some GO terms are more prevalent than others, and the distribution of such
prevalence may differ between species. The observation of a highly prevalent term in the
n homologues of a sequence may not be very significant if the probability of observing the
same term in a random sample of n sequences from that species is very high. Conversely,
the observation of a much less prevalent term is more significant. To quantify the enrichment
of a term in the homologues of a protein given prior knowledge of the prevalence of each
term, GOblet uses the Fisher exact test with Bonferroni correction to obtain a P -value. This
new version of GOblet also includes pathway annotations from MetaCyc [53]. GOblet can
be accessed via a web service at http://goblet.molgen.mpg.de.
GOtcha [54] takes a similar approach to GoFigure. Given a protein sequence, a score
R
max(log 10 ( E ), 0) is computed for each alignment, where E is the E -value of the align-
ment. Each GO term annotated to at least one protein in the BLAST results, as well as its
ancestor terms, is assigned a score equivalent to the sum of the scores R of each alignment
associated with the term. The score for each term is subsequently normalized by dividing
it by the score of the node that is the ancestor term of all scored terms, or the root term.
This normalized score is termed the internal score ( I -score), and reflects the relative signif-
icance of each term in the search results. A second score, termed the C -score, is computed
as log e of the root node, and reflects the confidence of the search results as a whole. To
obtain an intuitive and meaningful score for each prediction, an estimate of the accuracy
of various combinations of discretized I -scores and C -scores for each GO term is made
using annotated sequences from SwissProt. Each prediction is then assigned a score based
on the closest estimated accuracy based on its I -score and C -score. GOtcha is able to
=
Search WWH ::




Custom Search