Biology Reference
In-Depth Information
another not only in the choice of the threshold values for the length, % C
+
G , and
ECpG ratio for a sequences to be recognized as a CGI but also in whether islands
that are separated by small gaps would be merged if they still meet the cutoff criteria.
In addition, some CGI searcher sites use a modified criterion for CGIs, requiring that
the % C
O
/
ECpG thresholds are met over an average of several windows
(e.g., EMBOSS uses an average of 10 windows). For the same DNA sequences, these
differences in the algorithms would generally lead to different regions identified as
CGIs. More importantly, it has been shown that the sliding windows method does
not guarantee an exhaustive search and that it may fail to identify all regions on the
genome that meet the established criteria [ 17 ]. The use of alternative methods for CGI
identification, such as Hidden Markov Models (HMMs) [ 18 ] and clustering methods
[ 19 - 22 ] would therefore be preferable.
For the rest of this chapter we consider the HMM approach to locating CGIs. Such
models are based on assumptions about the distribution of the nucleotides and di-
nucleotide in the genome and provide a convenient mathematical framework within
which the question of locating the island regions translates intowell-understoodmath-
ematical problems. We begin with some examples.
The frequencies in Tables 9.1 and 9.2 present an example of nucleotide frequencies
obtained from a sequence of annotated human DNA of about 60,000 nucleotides
with known locations and lengths of the islands [ 23 ]. There are notable differences
in the distributions, in agreement with the expectations that island regions would
have elevated % C
+
G and O
/
G content and higher frequencies of the CpG dinucleotide.
For unannotated sequences those frequencies would be unknown and we would want
+
Table 9.1 Sample dinucleotide frequencies (from [ 23 ]). The first row repre-
sents the frequencies of the transitions from A to A, C, T, and G in island
and non-island regions and similarly for the other rows. Note that G is a lot
more likely to follow C in island regions. The transition frequencies have been
computed from annotated DNA as follows. If a ij stands for the transition fre-
quency (transition probability) from letter i to letter j in the island region, where
i , j
c ij
k Q c ik ,where
c ij + is the number of times the letter i followed by the letter j in the annotated
island regions. The transition probabilities a ij
,then a ij
is computed as the ratio a ij
Q , Q
={
A , C , T , G
}
=
for the non-island regions are
computed in the same way.
Island (“+”)
Non-Island ('
')
Dinucleotide Frequencies
Dinuceotide Frequencies
A
C
T
G
A
C
T
G
A
0.180
0.274
0.120
0.426
0.300
0.205
0.210
0.285
C
0.171
0.368
0.188
0.274
0.322
0.298
0.302
0.078
T
0.161
0.339
0.125
0.375
0.248
0.246
0.208
0.298
G
0.079
0.355
0.182
0.384
0.177
0.239
0.292
0.292
 
Search WWH ::




Custom Search