Prediction of Protein Function - Genomics: Essential Methods

Biomedical Engineering Reference

In-Depth Information

to arise from evolutionary conservation, such as that between homologues. The simplest

forms of protein function inference given the sequence of a protein would be to search

for likely homologues using BLAST and examine the annotations of these homologous

proteins. Building on this principle, an array of tools has been developed to extend this

concept in different directions and degrees to improve functional inference using sequence

homology.

9.2.3.2 Automated function prediction from homologues

GoFigure [2] is one of the earliest among such tools. Given the sequence of a protein,

GoFigure first performs a homology search using BLAST to find GO-annotated proteins

with similar sequences. The sub-graph of the GO DAG with the greatest depth from the root

that includes all GO terms assigned to these proteins is then identified. This graph is termed

the minimum covering graph (MCG). Each term in this MCG is then assigned a weighted

score derived from alignments to proteins with the term. The amount of contribution of

an alignment is inversely related to its E -value. The score for each term in the MCG

is then normalized by dividing it by that of the root. Terms with a normalized score of

0.2 and greater are then reported as inferred annotations for the query protein. GoFigure

provides a systematic way of assigning weighted GO terms to a query protein based on

homology search, giving higher weight to terms annotated to more proteins in the search

results, as well as terms associated with more significant alignments. GoFigure was initially

available at http://udgenome.ags.udel.edu/frm_go.html/, but is no longer available at the

time of writing.

GOblet [1] is another tool that automates GO term inference from BLAST searches.

The newest version of GOblet includes statistical analysis on the terms associated with

proteins found. Some GO terms are more prevalent than others, and the distribution of such

prevalence may differ between species. The observation of a highly prevalent term in the

n homologues of a sequence may not be very significant if the probability of observing the

same term in a random sample of n sequences from that species is very high. Conversely,

the observation of a much less prevalent term is more significant. To quantify the enrichment

of a term in the homologues of a protein given prior knowledge of the prevalence of each

term, GOblet uses the Fisher exact test with Bonferroni correction to obtain a P -value. This

new version of GOblet also includes pathway annotations from MetaCyc [53]. GOblet can

be accessed via a web service at http://goblet.molgen.mpg.de.

GOtcha [54] takes a similar approach to GoFigure. Given a protein sequence, a score

R

max(log 10 ( E ), 0) is computed for each alignment, where E is the E -value of the align-

ment. Each GO term annotated to at least one protein in the BLAST results, as well as its

ancestor terms, is assigned a score equivalent to the sum of the scores R of each alignment

associated with the term. The score for each term is subsequently normalized by dividing

it by the score of the node that is the ancestor term of all scored terms, or the root term.

This normalized score is termed the internal score ( I -score), and reflects the relative signif-

icance of each term in the search results. A second score, termed the C -score, is computed

as log e of the root node, and reflects the confidence of the search results as a whole. To

obtain an intuitive and meaningful score for each prediction, an estimate of the accuracy

of various combinations of discretized I -scores and C -scores for each GO term is made

using annotated sequences from SwissProt. Each prediction is then assigned a score based

on the closest estimated accuracy based on its I -score and C -score. GOtcha is able to

=

Genomics: Essential Methods

Search WWH ::

Custom Search

Home