Identifying CpG Islands: Sliding Window and Hidden Markov Model Approaches - Mathematical Concepts and Methods in Modern Biology

Biology Reference

In-Depth Information

With this background, we can now ask the following questions: Given an observed

sequence x

π =

π 1 π 2 π 3 ··· π l that maximizes the probability of observing x ? Can we estimate the

parameters of the HMM (the transition matrix P and the emission probabilities) from

the observed sequence x?

The main reason for considering these gambling examples is that the problem of

locating CGIs is in many ways mathematically analogous and may be stated similarly.

A DNA sequence x composed of the symbols A, C, T , and G may be viewed as gen-

erated by a mechanism switching between two hidden states Q

x 1 x 2 x 3 ···

x l , how do we determine the hidden sequence

, analogous

to the honest and dishonest dice. One state is that of CGIs (the “+” region), the other

is the non-island region (the “

={+ , −}

” region). The states A, C, T , and G are the same

for both regions, but the transition matrices and/or the emission probabilities will be

different. 5 We cannot observe the state-to-state transitions

−

π = π 1 π 2 π 3 ··· π l of the

process directly but we observe the sequences x

x 1 x 2 x 3 ···

x l emitted by the process

where each x t is a nucleotide from the set M

. The questions above for

the casino games are now immediately translated into the following questions: Given a

long DNA sequence x

}

TCAC , how do we determine which hidden

path is the most likely to have emitted this sequence? Can we estimate the transition

and the emission probabilities of the HMM from the observed sequence? To examine

these questions in detail, we first need to introduce the general notation for HMMs.

In general, a HMM consists of a hidden Markov chain with a finite state space

Q and a finite set of emission symbols M . The state of the hidden process at time is

denoted by

ACTGTC

···

π t . The transition probabilities are a ij =

(π t + 1

| π t

)

, and the

, i ∈ Q p i

initial distribution is p i

a 0 i

(π 1 =

∈

1. The process emits

a symbol k

∈

M from each of the hidden states j

∈

Q that it visits. The emission

probabilities are denoted by e j (

) :

e j (

) =

(

x t

| π t

∈

M ,

where (since each of the hidden states emits exactly one symbol) k ∈ M e j (

The set of transition, emission, and initial probabilities forms the set of parameters

of the HMM.

A convenient way to summarize the parameters of a HMM is to organize them in

a table where the transition matrix, emission probabilities, and the initial distribution

are given in the columns and where the rows are labeled by the hidden states. Table 9.3

contains the parameters of the HMM from Example 9.3 .

Exercise 9.4. Give the table for the parameters of the HMM from Example 9.4 .

) =

9.3.3 Viewing DNA Sequences As Outputs of a HMM

Tables 9.1 and 9.2 demonstrate two different ways of viewing the quantitative differ-

ences between the “

” and the “

−

” regions in the genome, each one of which can

5 For example, within the “

” regions, the transition probabilities could be similar to those

in Table 9.1 (with the modification that there should also be small nonzero probabilities of switching

between the “

”andthe“

−

”andthe“

−

” regions).

Mathematical Concepts and Methods in Modern Biology

Search WWH ::

Custom Search

Home