Biology Reference
In-Depth Information
of a sequence is the fraction of the combined number of C s and G s in the sequence
divided by the total number of nucleotides in the sequence. To define O
ECpG ,we
note that if dinucleotides in a DNA sequence were formed by random independent
choices of two nucleotides, the expected number of CpG dinucleotides in a sequence
of length l would be
/
(
number of C s in the sequence
) (
number of G s in the sequence
)/
l
.
The observed CpG would be the actual count of CpG dinucleotides found in the
sequence of length l . The observed over the expected CpG ratio O
/
E is defined as
the ratio of these two numbers (and, unlike the quantity % C
+
G , may assume values
greater than 1).
In the original study published in 1987 [ 13 ], Gardiner-Garden and Frommer defined
CGIs in the vertebrate genome as sequences that have: (1) length of at least 200 bp,
(2) % C
6. This definition is still commonly used
today but it serves more as a guideline since there is no universal standard for the cutoff
values. For instance, Takai and Jones [ 8 ] used a more stringent criterion to analyze
CGIs in human chromosome 21 and 22: (1) length
+
G
50%, and (3) O
/
ECpG
0
.
500 bp, (2) % C
+
G
0
.
55, and
65 motivated by reducing the number of CGIs found within Alus . 2
Algorithms for extracting CGIs often utilize a sliding windows approach that
has been implemented by many web-based software systems including CpGPlot/
CpGReport [ 14 ], CpGProd [ 15 ], CpGIS [ 8 ], and CpGIE [ 16 ]. The method calculates
the % C
(2) O
/
ECpG
.
ECpG for subsequences of fixed length l that differ from
one another only by 1 bp (the new subsequence is offset by 1 bp to the right from
the previous one). One can visualize the process as sliding a “window” of length l
along the genome. If the subsequence in the window meets the specific cutoff val-
ues for % C
+
G and O
/
+
/
ECpG , it will be included in a (possibly larger) CpG
island region. The details of the specific algorithm implemented by CpGIS are shown
in Figure 9.4 . An animated version of a sliding windows algorithm is available in
the CpG Educate suite that has been developed for this chapter and is available at
http://inspired.jsu.edu/
G and O
agarrett/cpg/ .
The project Investigating Predicted Genes available online from the volume's
website as part of this chapter utilizes sliding windows software to search for the
presence of CGIs in the vicinity of predicted genes. The existence of a CGI in the
area of the sequence where the promoter for the predicted gene should be found would
be an additional piece of evidence suggesting that the predicted gene may be, in fact,
an actual gene.
Sliding windows algorithms are not based on any specific assumptions of struc-
tural mechanisms (mathematical or biological) that can explain the differences in
CpG density between the island and non-island regions. As such, they do not uti-
lize any mathematical models, theory, or specialized tools to make the questions of
CGI identification more tractable. Sliding windows algorithms often differ from one
2 Alu sequences (named for the restriction endonuclease AluI, which cuts in these sequences) are short
repetitive sequences with a relatively high C
+
G content and O
/
ECpG ratio.
 
 
Search WWH ::




Custom Search