Information Technology Reference
In-Depth Information
themes. Accordingly, we proposed the Bayesian latent semantic model for
document generation.
Let document set be
D
= {
d 1 , d 2 , …, d n }, and word set be
W
= {
w 1 , w 2 , …,
w m }. The generation model for document
d
D
can be expressed as follows:
(1) Choose document
d
at the probability of
P
(
d
);
(2) Choose a latent theme
z
, which has the prior knowledge
p
(
z| ȶ );
(3) Denote the probability that theme
z
contains document d by
p
(
z|d, ȶ )
(4) Denote the probability of word
w
W
under the theme
z
by
p
(
w| z, ȶ )
After above process, we get the observed pair (d
, w
). The latent theme
z
is
omitted, and joint probability model is generated:
p d w
(
,
)
=
p d p w d
(
)
(
|
)
(6.44)
p
(
w
|
d
)
=
p
(
w
|
z
,
θ
)
p
(
z
|
d
,
θ
)
(6.45)
z
Z
This model is a hybrid probabilistic model under the following independence
assumptions:
(1) The generation of each observed pair (
d, w
) is relative independent, and they
are related via latent themes.
(2) The generation of word
w
is independent of any concrete document
d
. It only
depends on latent theme variable
.
Formula (6.45) indicates that in some document
z
, the distribution of word w
is the convex combination of latent themes. The weight of a theme in the
combination is the probability, at which document d belongs to the theme. Figure
6.3 illustrates the relationships between factors in the model.
d
d
d
d
...
d
2
3
1
n
z
z
...
z
1
2
k
w
w
w
...
w
1
2
3
m
Figure 6.3. Bayesian latent semantic model.
According to Bayesian formula, we substitute formula (6.45) into formula (6.44)
and get:
Search WWH ::




Custom Search