Biomedical Engineering Reference
In-Depth Information
Synergizer [50] maintains a database to translate between the different identifiers of bio-
logical entities. The translation service can be accessed interactively via the web page at
http://llama.med.harvard.edu/synergizer/translate/, or programmatically using a remote pro-
cedure call to a web service via the Hyper Text Transport Protocol (HTTP). The web
service returns a JSON-encoded object (JavaScript Object Notation), which can be eas-
ily decoded for further processing. More details on the JSON format can be found at
http://www.json.org/. Details on how to access the Synergizer web service are available
at http://llama.med.harvard.edu/synergizer/doc/, with examples using Perl. At the time of
writing, Synergizer covers genes from 50 genomes.
9.2.3 Sequence homology
Sequence homology forms the basis of gene/protein function inference in early approaches,
and remains very useful and widely used. Using the peptide sequence of an unknown protein
to infer its function is not only intuitive (since amino acids form its basic building blocks),
but also often necessary, as it is usually the only biological information available for a novel
protein, as in the case of a newly sequenced genome.
9.2.3.1 Homologue discovery
The most popular way to get a quick suggestion on the possible characteristics of a protein
given its sequence is to search for annotated proteins that have very high levels of sequence
similarity to it. Proteins with very similar sequences are likely to be homologous, which
means that they originated from the same gene in an ancestor and are conserved during
evolution. Since proteins are vital players in the performance of various biological func-
tions necessary for the survival of an organism, their sequence is conserved by selective
pressure during speciation so that the orthologous proteins in each species retain their abil-
ity to function effectively. Paralogous proteins which are homologues in the same species
arising from gene duplication events also tend to retain similar sequence and functions,
although they are more likely to diverge in these, since only one paralogous gene needs to
be conserved to uphold the role of the original gene. However, proteins with high sequence
similarity may not necessarily be conserved homologues, but could have arisen by chance
during evolution. This is likely when the proteins have very short sequences, but becomes
less probable with longer sequences. Hence, this must be taken into account when searching
for homologues.
The Basic Local Alignment Search Tool (BLAST) [51] does this very well and very
quickly, and has become the run-of-the-mill tool for this purpose for experimental and
computational biologists. Using a heuristic combination of exact matching with extension,
the tool is able to perform very fast local sequence alignment between a query sequence
and a large database of sequences that allows for inexact matching including insertions,
deletions and mismatches. BLAST also comes with highly configurable parameters, such
as gap initiation and extension penalties, and the choice of substitution matrix [52]. The
tool also comes with several scoring metrics, including sequence identity (percentage of
sequence with exact match), alignment score and a very useful and widely used statistical
score known as the expected value ( E -value). The E -value reflects the expected num-
ber of sequences in the database that are likely to obtain a similar alignment score with
the query sequence by chance. A low E -value indicates that an alignment is more likely
Search WWH ::




Custom Search