Probabilistic Reasoning - Advanced Artificial Intelligence

Information Technology Reference

In-Depth Information

gained good classification results, but they still need certain number of labeled

documents (Nigam, 1998). Web clustering is to merge related web pages into one

cluster with some similarity criterion. When dealing with high dimensional and

massive data, conventional clustering methods can not achieve satisfied

effectiveness and efficiency. The reason is: on the one hand, unsupervised search

in solution space is to some extent blind; on the other hand, common similarity

metric, e.g. Euclidean distance, does not work well in high dimensional space

and it is hard to find proper similarity metric in this situation. Considering the

characters of supervised learning and unsupervised learning, we proposed a

semi-supervised learning algorithm. Under the framework of Bayesian latent

semantic model, we can classify documents into different classes with some user

provided latent class variables. In this process, no labeled training documents are

required.

The general model is described as: given document set

= {

d 1 , d 2 , …, d n }

and its word set

= {

w 1 , w 2 , …, w m }, and a group of class variable

= {

z 1 , z 2 , …,

z k } with its prior information ȶ

= { ȶ 1 , ȶ 2 , …, ȶ k }, try to seek a division

D j

(

∈ (1, …,

)) of

, so that:

(

≠

)

Firstly, we divide

into two sets:

D L ∪ D U , where

{

∃

∈

[

]}

{

∀

∉

∈

[

]}

In our algorithm, the classification process includes two stages:

Stage 1 Utilize Bayesian latent semantic model with the parameters estimated

based on EM algorithm to label the documents in

D L :

(

)

max

{

(

)}

(6.51)

Stage 2 Train a naïve Bayesian classifier with the labeled documents in

D L , and

label documents in

D U with this classifier. Then update parameters of Bayesian

latent semantic models with EM algorithm.

6.7.2 Label documents with latent classification themes

Ideally, any document will not contain more than one latent class theme. In this

case, we can easily label a document with it latent theme. In practice, however,

the ideal status is hard to achieve. On the one hand, it is difficult to find such

latent theme; on the other hand, there may be multiple themes in one document.

For example, a document labeled with “economics” may contain words of other

Advanced Artificial Intelligence

Search WWH ::

Custom Search

Home