Biology Reference
In-Depth Information
Table 9.5 to obtain a HMMwith transition and emission probabilities such as those in
Table 9.4 .
The parameters in Table 9.5 present a very special case among all possible HMMs
with Q
and we have
included it here since it provides a natural way to include the information from
Table 9.1 . Ultimately, though, the parameters of the HMM would need to be esti-
mated from the DNA data, and it would be more appropriate to consider the model
in its full generality. Thus, any transition matrix P under the Transitions heading, any
emission matrix E under the Emissions heading, and any initial distribution could be
used as model parameters. In [ 27 ], the authors consider such a general model and
estimate the parameters from a set of 1000 DNA sequences.
={
A
+ ,
A
,
C
+ ,
C
,
T
+ ,
T
,
G
+ ,
G
}
and M
={
A
,
C
,
T
,
G
}
9.4 THREE CANONICAL PROBLEMS FOR HMMs
WITH APPLICATIONS TO CGI IDENTIFICATION
We now turn to the mathematical solutions of the questions raised above and describe
how they can be solved in a computationally efficient way. After discussing each
problem, we provide examples and exercises with connections to CGI identification.
We begin with restating the questions in the context of HMMs:
1. Decoding Problem : Given an observed path x
=
x 1 x 2 x 3 ···
x l generated
by a HMM with known parameters, what
is the most
likely hidden path
π
= π 1 π 2 π 3 ··· π l
to emit x ? In other words, how do we find
π max
=
? 6 From Example 9.3 we know how to compute P
argmax
P
|
x
)
(
x
,π)
for
π
any hidden path
π
and it can be shown (see Exercise 9.6 below) that
π max =
argmax
π
P
|
x
) =
argmax
π
P
(
x
,π).
Since there are finitely many hidden paths, one may be tempted to answer the
question by computing P
, compare their values, and pick
the largest. Such a “brute force” approach is not practical, however, since the
number of all hidden paths grows exponentially with the length of the paths l .
The number
(
x
,π)
for all paths
π
l of all such paths is astronomical for large values of l , making
the task impossible even for the fastest computers. 7
2. Evaluation Problem : Given an observed path x , what is the probability of this path
P
|
Q
|
(
x
)
? Mathematically, this probability can be expressed as
P
(
x
) =
P
(
x
,π),
(9.1)
π
6 The notation is for the argument of themaximum.We need the path(s)
ismaximal.
7 For an observed sequence of just 90 nucleotides, the number of paths is a bit over 10 80 ,whichis
the estimated number of atoms in the observable universe. The brute force approach would therefore
require years of computing time even if large computing resources are applied. In principle, we would
be interested in sequences of tens of thousands of nucleotides.
π
for which P
(
x
,π)
 
Search WWH ::




Custom Search