Identifying CpG Islands: Sliding Window and Hidden Markov Model Approaches - Mathematical Concepts and Methods in Modern Biology

Biology Reference

In-Depth Information

For our main problem of CpG identification, the probability P

(

)

of a DNA

sequence of nucleotides x

x l is not of much interest by itself but we will

need it to compute the posterior probabilities P

x 1 x 2 ···

l ,

that symbol x t in the observed sequence was emitted from state k , which can then be

used for decoding.

Since

(π t

∈

,...,

(

,π t =

)

(π t =

) =

(9.3)

(

)

we need a recursive algorithm for computing P

(

,π t =

)

. The Markov property of

the hidden process makes this possible:

(

,π t =

) =

(

x 1 x 2 ···

x t ,π t =

)

(

x t + 1 x t + 2 ···

x l |

x 1 x 2 ···

x t ,π t =

)

(

x 1 x 2 ···

x t ,π t =

)

(

x t + 1 x t + 2 ···

x l | π t =

(9.4)

The first line is a direct application of the conditional probability formula where the

probability that x is generated with

k at time t is given as the product of the

probabilities of the following events: (1) symbols x 1 x 2 ···

π t

x t are emitted up to time t

and the process is in state k at time t , and (2) conditioned upon the event (1), the rest

of the emitted sequence is x t + 1 x t + 2 ···

x l . The second line follows from the Markov

property of the hidden process and restates that the probability to emit the sequence

x t + 1 x t + 2 ···

x l depends only on the state of the process at time t .

Notice that the P

(

x 1 x 2 ···

x t ,π t =

)

are exactly the probabilities f k (

)

, which are

computed from the forward algorithm. Denote b k (

) =

(

x t + 1 x t + 2 ···

x l | π t

)

Equation ( 9.4 ) can now be re-written as

(

,π t =

) =

f k (

)

b k (

(9.5)

are computed by the backward algorithm .Webeginby

initializing the algorithm for t

The probabilities b k (

)

l where, since x l is the last observed symbol,

π l + 1

E is the end state (that does not emit a symbol). Thus b k (

) =

(π l + 1

| π l =

) =

∈

Q , since the sequence goes to the end state E with probability 1.

Once we know b k (

)

for all k

∈

Q , for any of the values t

−

,...,

we can compute

b j (

) =

(

x t + 1 x t + 2 ···

x l | π t =

)

(π t + 1 =

| π t =

)

e k (

x t + 1 )

(

x t + 2 ···

x l | π t + 1 =

)

∈

a jk e k (

x t + 1 )

b k (

∈

The justification is as follows: At time t

1 the process can transition from j to any

other state k

∈

Q (this happens with probability a jk )

,emit x t + 1 (this happens with

probability e k (

x t + 1 ))

, and, being in state k at time t

1, emit the rest of the sequence

x t + 2 ···

x l (which happens with probability b k (

))

Mathematical Concepts and Methods in Modern Biology

Search WWH ::

Custom Search

Home