Information Technology Reference
In-Depth Information
recall will be improved. Meanwhile, classification can provide good organization
of information so that to help user to browse and filter information. Many big
Websites adopt this kind of information organization. For example, Yahoo
maintains its Web catalog structure manually; Google uses some sorting
mechanism to let the most user-related pages ranked ahead, so as to make users'
browse convenient. Deerwester et al. take the advantage of linear algebra and
perform information filtering and latent semantic index (LSI) via singular value
decomposition (SVD) (Deerwester, 1990). They project the high dimensional
representation of documents in vector space model (VSM) to a low dimensional
latent semantic space (LSS). This approach on the one hand reduces the scale of
the problem, and on the other hand to some extent avoids the over sparse data. It
gains preferable effects in many applications including language modeling, video
retrieval, and protein database.
Clustering is one of main approaches in text mining. Its primary effects
include: a) by clustering search results, Website can provide users required Web
pages in terms of classes, so that users can quickly locate to their expected targets;
b) generating catalog automatically; c) analyzing the commonness in web pages
by clustering them. The typical clustering algorithm is K-means clustering.
Besides, some new clustering algorithms, such as self-organizing map (SOM),
clustering with neural networks, probability based hierarchical Bayesian
clustering (HBC), also receive much study and many applications. Yet most
clustering algorithms are unsupervised algorithms, which search the solution
space somewhat blindly. Thus, the clustering results are often lack of semantic
characters. Meanwhile, in high dimensional cases, selecting proper distance
metric becomes very difficult.
Web classification is one kind of supervised learning. By analyzing training
data, classifiers can predict the class labels for unseen Web pages. Currently,
there are many effective algorithms to classify Web pages, such as naïve
Bayesian method and SVM. It is a pity that obtaining a large amount of classified
training samples, which are necessary for training high precise classifiers, are
very costly. Besides, in practical, different classification architectures are often
inconsistent. This makes daily maintaining of Web catalog difficult. Kamal
Nigam et al. proposed a method that can utilize documents with class labels and
those without class labels to train classifier. It only requires a small amount of
labeled training samples, and can learn a Bayesian classifier by integrating
knowledge in the unlabeled samples (Nigam, 1998).
Our basic idea for solving this problem is as follows. If some Web pages
D
=
{
d 1 , d 2 ,
, d n } consist a description of some latent class variables Z = { z 1 , z 2 ,
?
, z k }, firstly, by introducing Bayesian latent semantic model, we assign
documents containing latent class variables to corresponding class; then we
utilize naïve Bayesian model to classify the documents containing no latent class
?
Search WWH ::




Custom Search