Prediction of Protein Function - Genomics: Essential Methods

Biomedical Engineering Reference

In-Depth Information

provide a more meaningful score than GoFigure that takes into account estimated accuracy

based on some characteristics of the search results. Since the estimated accuracy is made

separately for each GO term, the approach also accounted for the differences in the back-

ground frequency of each term. GOtcha is available as a web service at http://www.compbio

.dundee.ac.uk/gotcha/gotcha.php.

GOAnno [55] takes a different approach in the use of sequence homology for function

inference. Given a query proteins sequence, PipeAlign [56] is used to search for its homo-

logues and construct a multiple alignment of complete sequences (MACS) that consists of

clusters of homologues, each representing a potential functional subgroup. GO terms are

then assigned based on three sets of annotations. The first set is the initial protein gene

ontology (IPO), which is the set of already known annotations for query gene. The sec-

ond set, the proximal protein gene ontology (PPO) is the set of GO terms annotated to

proteins that share at least 98% sequence identity with the query protein. The last set, the

mean subfamily gene ontology (MSO) is the set of GO terms annotated to sequences in the

subgroups detected by PipeAlign that fulfill the NorMD [57] multiple sequence alignment

score of NorMD > 0.3. Each term is scored by the number of homologous proteins that

are annotated with the term or its descendant terms. Some thresholds are also imposed to

remove GO branches that are associated with too few proteins. The three sets of annotations

are combined to get the final predicted GO terms. GOAnno is available as a web service at

http://bips.u-strasbg.fr/GOAnno/GOAnno.html.

GOPET [3] takes a machine-learning approach towards function prediction from sequence

homology. A large number of sequences are searched against a database of GO-annotated

sequences. For each query, the GO terms annotated to each homologue found are used as

training examples; a term is deemed a positive example if it is annotated to the query protein

and negative otherwise. Each term is assigned a number of features, such as the E -value,

alignment bit scores and sequence identity of the alignment, as well as the background

frequency of the term, the evidence codes used for the annotation of these terms, and

so on. The training examples are then split randomly into smaller sets that are used to

build multiple classifiers using support vector machines (SVMs). To predict functions for

a given query protein sequence, homologous proteins are obtained using BLAST and each

GO term annotated to these proteins is then scored by building similar features for it and

using the classifiers to classify it as positive or negative. The votes from the classifiers are

summed to obtain the final score. The authors of GOPET compared the method against

GOtcha and found that they performed comparably. GOPET is available as a web service

at http://genius.embnet.dkfz-heidelberg.de/menu/biounit/open-husar.

9.2.3.3 Remote homology

PFP [6] (http://dragon.bio.purdue.edu/pfp/) Position-Specific Iterative Basic Local Align-

ment Tool (PSI-BLAST) improves upon existing sequence-based approaches by extending

a sequence homology search beyond sequences with highly similar sequences. Instead of

using BLAST, PSI-BLAST [58] is used. PSI-BLAST performs an initial BLAST search

using the query sequence and performs multiple sequence alignment on close homologues

discovered, using the query sequence as a template. This alignment is then used to create

a profile taking into account amino acid variation in specific positions of the profile. The

profile, which reflects a model of the homologues found in the BLAST search, is then used

to search against sequences in the database with a slightly modified BLAST algorithm.

Genomics: Essential Methods

Search WWH ::

Custom Search

Home