Biology Reference
In-Depth Information
9.4.3 Training: The Baum-Welch Algorithm
For all computations until now, we always assumed the parameters of the HMM
are known. Those parameters, however, are usually only initial estimates and, in
most cases, we may not even have those estimates to begin with. In this section
we will discuss how to estimate the HMM parameters from the data. This process
is called training (or sometimes learning ). The algorithm presented here was first
introduced in [ 31 ] (see also [ 32 ]). We begin with an observed sequence x or a set
of observed sequences x 1
x m for which we would want to adjust the HMM
model parameters to ensure the best possible fit. Once a set of parameters is determined
for those sequences, we apply the Viterbi algorithm or use the posterior probabilities
to other similar sequences for decoding. The sequences x 1
x 2
,
,...,
x m are called
training sequences and we say that we will train the model to best fit those sequences.
More formally, this means that given the data x 1
x 2
,
,...,
x m and the HMM, we
need to find the values of the model parameters (the transition probabilities a jk and the
emission probabilities e k (
x 2
,
,...,
b
)
for j
,
k
Q and b
M
)
that maximize the probability
of the data P
to refer to the whole set of parameters for the model, in
this section we will sometimes write P
(
x
)
.Ifweuse
θ
(
x
) =
P
(
x
| θ) =
P
(
x
| θ,
HMM
)
to emphasize
θ
that the likelihood of the data depends on the set of parameters
and on the model.
The goal is to find
θ =
argmax
θ
{
P
(
x
| θ) } .
If the training sequences are annotated and we know the exact paths of the process
emitting the training sequences, the estimates for the model parameters will be com-
puted by determining the frequencies of each transition and emission. Assume that
A jk is the number of transitions from j to k in the set of training sequences and E k (
b
)
is
the number of times the state k
Q emitted the symbol b
M . Then the frequencies
A jk
r Q A jr
)
c M E k (
E k (
b
a jk =
and e k (
b
) =
) ,
for j
,
k
Q
,
b
M
(9.8)
c
are maximum likelihood estimates for the HMM [ 23 ].
When the paths that generate the training data are unknown, we can no longer
determine the counts A jk and E k (
but they can be replaced with the expected
counts for each transition/emission in the HMM. Obtaining the maximum likelihood
estimates in this case can be done by an iterative process known as the Baum-Welch
algorithm [ 31 ]. This method is no longer guaranteed to find a global maximum for the
probability of the data but the iterative steps will converge to a local maximum. The
Baum-Welch algorithm is a special case of a more general class of methods known
as Expectation Maximization algorithms or EM algorithms. Below we will outline
the computational aspect of the Baum-Welch method but will omit the theoretical
proofs of convergence to a maximum. Those proofs together with the general theory
of the EM algorithms can be found in [ 33 ]. In the description of the Baum-Welch
algorithm below we will assume that the HMM is being trained on a single data
b
)
Search WWH ::




Custom Search