Identifying CpG Islands: Sliding Window and Hidden Markov Model Approaches - Mathematical Concepts and Methods in Modern Biology

Biology Reference

In-Depth Information

Table 9.2 Sample nucleotide frequencies from [ 23 ]. The entries are computed

by averaging the columns of Table 9.1 for the (“+”) and for the (“

−

”) regions.

A

C

T

G

Island (“+”) Frequencies:

0.15

0.33

0.16

0.36

Non-Island (“−”) Frequencies:

0.27

0.24

0.26

0.23

to have a way of obtaining some estimates for those from the data. This may initially

seem to be an impossible task. After all, how can we estimate those frequencies when

the boundaries between the regions are unknown? As we will see below, there is a way

to solve this problem, if we view the DNA data as generated by a known underlying

mechanism, formally described by a HMM.

The rest of the chapter is organized as follows: Section 3 contains a brief review of

Markov chains and HHMs, presenting some examples, highlighting some main con-

cepts, and introducing the notation. Readers without prior experience with Markov

chains or HMMs may need to consider a more detailed introduction to these top-

ics such as [ 25 , 26 ]. In Section 4 we present CGI identification methods based on

HMMs. The section focuses on the questions of evaluation, decoding, and parameter

estimation for HMM. Some final comments and notes on generalizations and ongoing

work involving HMMs for CGI identification are gathered in Section 5. A suite of

web applications, CpG Educate , has been developed by the authors to illustrate the

algorithms and is used for many of the examples and exercises in the chapter. CpG

Educate is freely available at http://inspired.jsu.edu/

∼

agarrett/cpg/ .

9.3 DEFINITION AND BASIC PROPERTIES OF MARKOV

CHAINS AND HIDDEN MARKOV MODELS

9.3.1 Finite State Markov Chains

A Markov chain is a “process” that at any particular time is in one of a finite number

of “states” from a set Q

. We will only consider discrete time, which

means that the process can be visualized as running according to a digital timer,

possibly changing states at each time step. The Markov property assumes the state of

the process at the next step depends only on its present state and not on the history

of how the process arrived at that state. Thus, the process has no memory beyond

the “present” and the only factors that govern the process are (i) its current state, and

(ii) the probabilities withwhich the process moves from its current state to other states.

Assume that

={

s 1 ,...,

s n }

,...

As time goes on, the process transitions from one state to another generating a path

π = π 1 π 2 ··· π l (the number l is the length of the path). If we have observed the path

of the process up to time t , the formal definition of the Markov property is that

π t ∈

Q denotes the state of the process at time t , where t

=

1

,

2

,

3

P

{ π t + 1 | π 1 π 2 ··· π t }=

P

{ π t + 1 | π t } ,

t

=

1

,

2

,

3

,...,

Mathematical Concepts and Methods in Modern Biology

Search WWH ::

Custom Search

Home