Prediction of Protein Function - Genomics: Essential Methods

Biomedical Engineering Reference

In-Depth Information

Synergizer [50] maintains a database to translate between the different identifiers of bio-

logical entities. The translation service can be accessed interactively via the web page at

http://llama.med.harvard.edu/synergizer/translate/, or programmatically using a remote pro-

cedure call to a web service via the Hyper Text Transport Protocol (HTTP). The web

service returns a JSON-encoded object (JavaScript Object Notation), which can be eas-

ily decoded for further processing. More details on the JSON format can be found at

http://www.json.org/. Details on how to access the Synergizer web service are available

at http://llama.med.harvard.edu/synergizer/doc/, with examples using Perl. At the time of

writing, Synergizer covers genes from 50 genomes.

9.2.3 Sequence homology

Sequence homology forms the basis of gene/protein function inference in early approaches,

and remains very useful and widely used. Using the peptide sequence of an unknown protein

to infer its function is not only intuitive (since amino acids form its basic building blocks),

but also often necessary, as it is usually the only biological information available for a novel

protein, as in the case of a newly sequenced genome.

9.2.3.1 Homologue discovery

The most popular way to get a quick suggestion on the possible characteristics of a protein

given its sequence is to search for annotated proteins that have very high levels of sequence

similarity to it. Proteins with very similar sequences are likely to be homologous, which

means that they originated from the same gene in an ancestor and are conserved during

evolution. Since proteins are vital players in the performance of various biological func-

tions necessary for the survival of an organism, their sequence is conserved by selective

pressure during speciation so that the orthologous proteins in each species retain their abil-

ity to function effectively. Paralogous proteins which are homologues in the same species

arising from gene duplication events also tend to retain similar sequence and functions,

although they are more likely to diverge in these, since only one paralogous gene needs to

be conserved to uphold the role of the original gene. However, proteins with high sequence

similarity may not necessarily be conserved homologues, but could have arisen by chance

during evolution. This is likely when the proteins have very short sequences, but becomes

less probable with longer sequences. Hence, this must be taken into account when searching

for homologues.

The Basic Local Alignment Search Tool (BLAST) [51] does this very well and very

quickly, and has become the run-of-the-mill tool for this purpose for experimental and

computational biologists. Using a heuristic combination of exact matching with extension,

the tool is able to perform very fast local sequence alignment between a query sequence

and a large database of sequences that allows for inexact matching including insertions,

deletions and mismatches. BLAST also comes with highly configurable parameters, such

as gap initiation and extension penalties, and the choice of substitution matrix [52]. The

tool also comes with several scoring metrics, including sequence identity (percentage of

sequence with exact match), alignment score and a very useful and widely used statistical

score known as the expected value ( E -value). The E -value reflects the expected num-

ber of sequences in the database that are likely to obtain a similar alignment score with

the query sequence by chance. A low E -value indicates that an alignment is more likely

Genomics: Essential Methods

Search WWH ::

Custom Search

Home