Biology Reference
In-Depth Information
With this background, we can now ask the following questions: Given an observed
sequence x
π =
π 1 π 2 π 3 ··· π l that maximizes the probability of observing x ? Can we estimate the
parameters of the HMM (the transition matrix P and the emission probabilities) from
the observed sequence x?
The main reason for considering these gambling examples is that the problem of
locating CGIs is in many ways mathematically analogous and may be stated similarly.
A DNA sequence x composed of the symbols A, C, T , and G may be viewed as gen-
erated by a mechanism switching between two hidden states Q
=
x 1 x 2 x 3 ···
x l , how do we determine the hidden sequence
, analogous
to the honest and dishonest dice. One state is that of CGIs (the “+” region), the other
is the non-island region (the “
={+ , −}
” region). The states A, C, T , and G are the same
for both regions, but the transition matrices and/or the emission probabilities will be
different. 5 We cannot observe the state-to-state transitions
π = π 1 π 2 π 3 ··· π l of the
process directly but we observe the sequences x
=
x 1 x 2 x 3 ···
x l emitted by the process
where each x t is a nucleotide from the set M
. The questions above for
the casino games are now immediately translated into the following questions: Given a
long DNA sequence x
={
A
,
C
,
T
,
G
}
TCAC , how do we determine which hidden
path is the most likely to have emitted this sequence? Can we estimate the transition
and the emission probabilities of the HMM from the observed sequence? To examine
these questions in detail, we first need to introduce the general notation for HMMs.
In general, a HMM consists of a hidden Markov chain with a finite state space
Q and a finite set of emission symbols M . The state of the hidden process at time is
denoted by
=
ACTGTC
···
π t . The transition probabilities are a ij =
P
t + 1
=
j
| π t
=
i
)
, and the
, i Q p i
initial distribution is p i
=
a 0 i
=
P
1 =
i
),
i
Q
=
1. The process emits
a symbol k
M from each of the hidden states j
Q that it visits. The emission
probabilities are denoted by e j (
k
) :
e j (
k
) =
P
(
x t
=
k
| π t
=
j
),
j
Q
,
k
M ,
where (since each of the hidden states emits exactly one symbol) k M e j (
1.
The set of transition, emission, and initial probabilities forms the set of parameters
of the HMM.
A convenient way to summarize the parameters of a HMM is to organize them in
a table where the transition matrix, emission probabilities, and the initial distribution
are given in the columns and where the rows are labeled by the hidden states. Table 9.3
contains the parameters of the HMM from Example 9.3 .
Exercise 9.4. Give the table for the parameters of the HMM from Example 9.4 .
k
) =
9.3.3 Viewing DNA Sequences As Outputs of a HMM
Tables 9.1 and 9.2 demonstrate two different ways of viewing the quantitative differ-
ences between the “
+
” and the “
” regions in the genome, each one of which can
5 For example, within the “
” regions, the transition probabilities could be similar to those
in Table 9.1 (with the modification that there should also be small nonzero probabilities of switching
between the “
+
”andthe“
+
”andthe“
” regions).
 
Search WWH ::




Custom Search