Identifying CpG Islands: Sliding Window and Hidden Markov Model Approaches - Mathematical Concepts and Methods in Modern Biology

Biology Reference

In-Depth Information

It can be shown that the computational complexity of the Viterbi algorithm is no

worse than O

n 2 l

in time 8 and O

is the number of states

and l is the length of the sequence. Thus, the complexity of the algorithm is quadratic

in time, which, although still computationally demanding, is a huge improvement

over the exponential rate of the brute force approach.

Since the Viterbi algorithm involves multiplying many probabilities, the results

could get very small, resulting in essentially zero probabilities and causing under-

flow problems. To avoid this, all calculations are performed on the logarithms of the

probabilities

(

)

(

)

in space, where n

, using addition instead of multiplication.

Exercise 9.6. Show that the hidden path found by the Viterbi algorithm is exactly

the path of maximal probability, given the observed sequence x . That is, show that

v k (

∈

,...

π max =

argmax

(

,π) =

argmax

(π |

Obviously, Example 9.5 shows that performing the Viterbi computations by hand

is tedious even for very short paths. From now on, we will use the CpG Educate suite

that implements the Viterbi algorithm for HMMs corresponding to the Dishonest

Casino problem from Example 9.4 and for the CGI identification problem.

The applications in the CpGEducate suite have the capability to generate an output

from a HMM and record the hidden path that produced the emitted sequence. This

simulated data can then be used to test the “accuracy” of the Viterbi algorithm by

comparing the predicted states from the Viterbi decoding with the actual ones from

the simulations.

Example 9.6. We used the Dishonest Casino application in the CpGEducate suite to

generate a sequence of length 400 for the Dishonest Casino HMM from Example 9.4 .

We then applied the Viterbi algorithm to the emitted sequence and compared the

output with the known hidden sequence from the simulations. Figure 9.7 depicts the

outcome. The gray background identifies rolls made with the unfair die. The outcomes

in bold are the predicted unfair die state from the path of maximal probability found

by the Viterbi algorithm. Although the overlap is not perfect, Viterbi appears to have

recovered the states fairly well. In practice, the Viterbi algorithmwill often miss short

sequences generated by either die. Since switching between the dice happens with

small probabilities, the path of maximal probability would generally tend to “bridge”

short gaps and preserve the trends set by longer runs.

When using a HMM for identifying CGIs, the outcome of the Viterbi algorithm

will produce the predicted path through the hidden “+” and “

” states. The start

of the island will be where the predicted hidden sequence switches from the subset

−

− ={

− ,

− }

to the subset Q

+ ={

+ ,

+ }

and the end of the

island will be where the sequence switches back from the “

”tothe“

−

” states.

8 The “big O ” notation here means that for very large values of n and l the number of operations

required by the Viterbi algorithm has the same order of magnitude as the number n 2 l .

Mathematical Concepts and Methods in Modern Biology

Search WWH ::

Custom Search

Home