Information Technology Reference
In-Depth Information
themes, e.g. “politics” and/or “culture”. We handle these cased by labeling them
with the most related theme. Under ML criterion, after some rounds of EM
iterations, we finally determine the theme of test document according to formula
(6.51).
EM algorithm is one of the primary parameter estimation approaches for
sparse data. It performs E step and M step alternately, so that to find the most
likely result. The general process of EM algorithm is described below:
(1) E step: calculate expectation based on the current parameters;
(2) M step: find the proper parameter with maximum likelihood based on the
expectation in E step;
(3) Compute the likelihood with the renewed parameters. If the likelihood
exceeds predefined threshold or number of iteration exceeds predefined value,
stop. Otherwise, go to Step (1).
In our algorithm, we adopt following two steps to perform iteration
(1) In E step, we obtain the expectation via the following Bayesian formula:
p
(
z
)
p
(
d
|
z
)
p
(
w
|
z
)
P
(
z
|
d
,
w
)
=
(6.52)
Ã
(
'
)
(
|
'
)
(
|
'
)
p
z
p
d
z
p
w
z
'
z
In terms of probabilistic semantics, the formula explains the probability of
word
.
(2) In M step, we use the expectation from the above step to estimate the density
of parameters.
w
in document
d
with latent theme variable
z
Ã
n
(
d
,
w
)
p
(
z
|
d
,
w
)
p
(
w
|
z
)
=
d
(6.53a)
Ã
'
'
n
(
d
,
w
)
p
(
z
|
d
,
w
)
'
d
,
w
Ã
n
(
d
,
w
)
p
(
z
|
d
,
w
)
p
(
d
|
z
)
=
w
(6.53b)
Ã
'
'
n
(
d
,
w
)
p
(
z
|
d
,
w
)
'
d
,
w
Ã
n
(
d
,
w
)
p
(
z
|
d
,
w
)
d
,
w
p
(
z
)
=
(6.53c)
Ã
n
(
d
,
w
)
d
,
w
Compared with SVD in LSA, EM algorithm has linear convergence time. It is
simple and easy to implement, and it results in local optimal of likelihood
function. Figure 6.4 shows the relation of iteration times and corresponding
maximum likelihood in our experiment.
Search WWH ::




Custom Search