Biology Reference
In-Depth Information
SVMs are widely used in microarray analysis, exploring issues such as the nature
of host-microbe interactions (
Cummings and Relman, 2000
), the prediction of mito-
chondrial proteins (
Kumar
et al.
, 2006
), prokaryotic gene finding (
Krause
et al.
,
2007
), protein functional classification (
Cai
et al.
, 2003
), protein subcellular locali-
sation (
Bhasin
et al.
, 2005; Gardy
et al.
, 2005; Gardy and Brinkman, 2006
) and even
the tracking of the source of microbes in heavily polluted water (
Belanche-Mu˜oz
and Blanch, 2008
).
Software Availability
available.
Gismo (Gene Identification Using a Support Vector Machine for ORF Classification):
http://www.cebitec.uni-bielefeld.de/groups/brf/software/gismo/
. Source code in Perl; requires
local installation of Perl plus a number of Perl modules.
SVMProt:
http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi
.
Protein functional family prediction.
5.2
Hidden Markov models
Hidden Markov models (HMMs) were first introduced in the 1960s (
Baum and
Petrie, 1966
), and have been applied to the analysis of time-dependent data in fields
as such as cryptanalysis, speech recognition and speech synthesis. Their applicability
to problems in bioinformatics became apparent in the late 1990s (
Krogh, 1998
).
HMMs are frequently used for the statistical analysis of multiple DNA sequence
alignments. They can be used to identify genomic features such as ORFs, insertions,
deletions, substitutions and protein domains, amongst many others. HMMs can also
be used to identify homologies; the widely used Pfam database (
Punta
et al.
, 2012
),
for example, is a database of protein families identified using HMMs. HMMs can be
significantly more accurate than the workhorse of sequence comparison tools,
BLAST (Basic Local Alignment Search Tool), first produced in 1990 (
Altschul
et al.
, 1990, 1997
).
An HMM is a statistical model of a sequence. It consists of a library of symbols
making up the sequence, and a set of
states
that an element of the sequence might
occupy. Each state has a set of weighted
transition probabilities
: the probability of
moving to a different state. A transition probability depends solely upon the previous
state; states prior to the previous state have no effect on transition probabilities. An
HMM also has a set of
emission probabilities
: the probability of producing a particular
element of the sequence (
Figure 2.7
). A model is trained using known sequences to
optimise the weights, and can then be applied to unknown sequences in order to make
predictions. Since several paths through an HMM may produce the same sequence,
paths are ranked by likelihood, by multiplying all of the probabilities together and tak-
ing the logarithm of the result. An algorithm known as the Viterbi algorithm (
Forney,
1973
) provides an optimal state sequence for many purposes.