Data mining for microbiologists - Methods in Microbiology

Biology Reference

In-Depth Information

SVMs are widely used in microarray analysis, exploring issues such as the nature

of host-microbe interactions ( Cummings and Relman, 2000 ), the prediction of mito-

chondrial proteins ( Kumar et al. , 2006 ), prokaryotic gene finding ( Krause et al. ,

2007 ), protein functional classification ( Cai et al. , 2003 ), protein subcellular locali-

sation ( Bhasin et al. , 2005; Gardy et al. , 2005; Gardy and Brinkman, 2006 ) and even

the tracking of the source of microbes in heavily polluted water ( Belanche-Mu˜oz

and Blanch, 2008 ).

Software Availability

SVM light : http://svmlight.joachims.org/.Free for scientific use; source code and binaries

available.

Gismo (Gene Identification Using a Support Vector Machine for ORF Classification):

local installation of Perl plus a number of Perl modules.

SVMProt: http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi . Protein functional family prediction.

5.2 Hidden Markov models

Hidden Markov models (HMMs) were first introduced in the 1960s ( Baum and

Petrie, 1966 ), and have been applied to the analysis of time-dependent data in fields

as such as cryptanalysis, speech recognition and speech synthesis. Their applicability

to problems in bioinformatics became apparent in the late 1990s ( Krogh, 1998 ).

HMMs are frequently used for the statistical analysis of multiple DNA sequence

alignments. They can be used to identify genomic features such as ORFs, insertions,

deletions, substitutions and protein domains, amongst many others. HMMs can also

be used to identify homologies; the widely used Pfam database ( Punta et al. , 2012 ),

for example, is a database of protein families identified using HMMs. HMMs can be

significantly more accurate than the workhorse of sequence comparison tools,

BLAST (Basic Local Alignment Search Tool), first produced in 1990 ( Altschul

et al. , 1990, 1997 ).

An HMM is a statistical model of a sequence. It consists of a library of symbols

making up the sequence, and a set of states that an element of the sequence might

occupy. Each state has a set of weighted transition probabilities : the probability of

moving to a different state. A transition probability depends solely upon the previous

state; states prior to the previous state have no effect on transition probabilities. An

HMM also has a set of emission probabilities : the probability of producing a particular

element of the sequence ( Figure 2.7 ). A model is trained using known sequences to

optimise the weights, and can then be applied to unknown sequences in order to make

predictions. Since several paths through an HMM may produce the same sequence,

paths are ranked by likelihood, by multiplying all of the probabilities together and tak-

ing the logarithm of the result. An algorithm known as the Viterbi algorithm ( Forney,

1973 ) provides an optimal state sequence for many purposes.

Search WWH ::

Custom Search

Home