Identifying CpG Islands: Sliding Window and Hidden Markov Model Approaches - Mathematical Concepts and Methods in Modern Biology

Biology Reference

In-Depth Information

another not only in the choice of the threshold values for the length, % C

G , and

ECpG ratio for a sequences to be recognized as a CGI but also in whether islands

that are separated by small gaps would be merged if they still meet the cutoff criteria.

In addition, some CGI searcher sites use a modified criterion for CGIs, requiring that

the % C

ECpG thresholds are met over an average of several windows

(e.g., EMBOSS uses an average of 10 windows). For the same DNA sequences, these

differences in the algorithms would generally lead to different regions identified as

CGIs. More importantly, it has been shown that the sliding windows method does

not guarantee an exhaustive search and that it may fail to identify all regions on the

genome that meet the established criteria [ 17 ]. The use of alternative methods for CGI

identification, such as Hidden Markov Models (HMMs) [ 18 ] and clustering methods

[ 19 - 22 ] would therefore be preferable.

For the rest of this chapter we consider the HMM approach to locating CGIs. Such

models are based on assumptions about the distribution of the nucleotides and di-

nucleotide in the genome and provide a convenient mathematical framework within

which the question of locating the island regions translates intowell-understoodmath-

ematical problems. We begin with some examples.

The frequencies in Tables 9.1 and 9.2 present an example of nucleotide frequencies

obtained from a sequence of annotated human DNA of about 60,000 nucleotides

with known locations and lengths of the islands [ 23 ]. There are notable differences

in the distributions, in agreement with the expectations that island regions would

have elevated % C

G and O

G content and higher frequencies of the CpG dinucleotide.

For unannotated sequences those frequencies would be unknown and we would want

Table 9.1 Sample dinucleotide frequencies (from [ 23 ]). The first row repre-

sents the frequencies of the transitions from A to A, C, T, and G in island

and non-island regions and similarly for the other rows. Note that G is a lot

more likely to follow C in island regions. The transition frequencies have been

computed from annotated DNA as follows. If a ij stands for the transition fre-

quency (transition probability) from letter i to letter j in the island region, where

i , j

c ij

k ∈ Q c ik ,where

c ij + is the number of times the letter i followed by the letter j in the annotated

island regions. The transition probabilities a ij

,then a ij

is computed as the ratio a ij

∈

Q , Q

A , C , T , G

}

for the non-island regions are

computed in the same way.

Island (“+”)

Non-Island ('

−

Dinucleotide Frequencies

Dinuceotide Frequencies

0.180

0.274

0.120

0.426

0.300

0.205

0.210

0.285

0.171

0.368

0.188

0.274

0.322

0.298

0.302

0.078

0.161

0.339

0.125

0.375

0.248

0.246

0.208

0.298

0.079

0.355

0.182

0.384

0.177

0.239

0.292

Mathematical Concepts and Methods in Modern Biology

Search WWH ::

Custom Search

Home