Information Technology Reference
In-Depth Information
gained good classification results, but they still need certain number of labeled
documents (Nigam, 1998). Web clustering is to merge related web pages into one
cluster with some similarity criterion. When dealing with high dimensional and
massive data, conventional clustering methods can not achieve satisfied
effectiveness and efficiency. The reason is: on the one hand, unsupervised search
in solution space is to some extent blind; on the other hand, common similarity
metric, e.g. Euclidean distance, does not work well in high dimensional space
and it is hard to find proper similarity metric in this situation. Considering the
characters of supervised learning and unsupervised learning, we proposed a
semi-supervised learning algorithm. Under the framework of Bayesian latent
semantic model, we can classify documents into different classes with some user
provided latent class variables. In this process, no labeled training documents are
required.
The general model is described as: given document set
D
= {
d 1 , d 2 , …, d n }
and its word set
W
= {
w 1 , w 2 , …, w m }, and a group of class variable
Z
= {
z 1 , z 2 , …,
z k } with its prior information ȶ
= { ȶ 1 , ȶ 2 , …, ȶ k }, try to seek a division
D j
(
j
(1, …,
k
)) of
D
, so that:
k
G
D
=
D
,
D
<
D
=
φ
(
i
j
)
j
i
j
=
1
j
Firstly, we divide
D
into two sets:
D
=
D L D U , where
D
=
{
d
|
j
,
z
d
,
j
[
>
k
]}
,
L
j
D
=
{
d
|
j
,
z
d
,
j
[
>
k
]}
In our algorithm, the classification process includes two stages:
Stage 1 Utilize Bayesian latent semantic model with the parameters estimated
based on EM algorithm to label the documents in
U
j
D L :
l
(
d
)
=
z
=
max
{
p
(
d
|
z
)}
(6.51)
j
i
i
Stage 2 Train a naïve Bayesian classifier with the labeled documents in
D L , and
label documents in
D U with this classifier. Then update parameters of Bayesian
latent semantic models with EM algorithm.
6.7.2 Label documents with latent classification themes
Ideally, any document will not contain more than one latent class theme. In this
case, we can easily label a document with it latent theme. In practice, however,
the ideal status is hard to achieve. On the one hand, it is difficult to find such
latent theme; on the other hand, there may be multiple themes in one document.
For example, a document labeled with “economics” may contain words of other
Search WWH ::




Custom Search