Web Mining by Automatically Organizing Web Pages into Categories - Distributed Artificial Intelligence, Agent Technology, and Collaborative Applications

Information Technology Reference

In-Depth Information

similar to one another than data in other clusters (Berkhin, 2002; Everitt et al., 2001; Jain & Dubes,

1998; Jain et al., 1999). Typical application areas for clustering include artificial intelligence, biology,

data mining, information retrieval, image processing, marketing, pattern recognition, statistics, and

Web mining (Berkhin, 2002; Everitt et al., 2001; Jain et al., 1999). Compared to classification methods,

cluster analysis has the advantage that it does not require any training data (i.e., the labeled data), but

can achieve the same goal in that it can arrange similar Web pages into groups.

The major aspects of the clustering problem for organizing Web pages are: To find the number of clusters

k in a Web page dataset, and to assign Web pages accurately to their clusters. Much work (Agrawal et al.,

1998; Dhillon et al., 2001; Ester et al., 1996; Guha et al., 1998a; Guha et al., 1998b; Hinneburg & Keim,

1999; Karypis & Kumar, 1999; Ng & Han, 1994; Rajaraman & Pan, 2000; Sander et al., 1998; Tantrum

et al., 2002; Yao & Karypis, 2001; Zhang et al., 1996; Zhao & Karypis, 1999) has been done to improve

the accuracy of assigning data to clusters in different domains, whereas no satisfactory method has been

found to estimate k in a dataset (Dudoit & Fridlyand, 2002; Strehl, 2002) though many methods were

proposed (Davies & Bouldin, 1979; Dudoit & Fridland, 2002; Milligan & Cooper, 1985). As a matter of

fact, finding k in a dataset is still a challenge in cluster analysis (Strehl, 2002). Almost all existing work

in this area assumes that k is known for clustering a dataset (e.g., Karypis et al., 1999; Zhao & Karypis,

1999). However in many applications, this is not true because there is little prior knowledge available

for cluster analysis except the feature space or the similarity space of a dataset.

This chapter addresses the problem of estimating k for Web page datasets. By testing many existing

methods for estimating k for datasets, we find that only the average inter-cluster similarity ( avgInter )

need to be used as the criterion to discover k for a Web page dataset. Our experiments show that when

the avgInter for a Web page dataset reaches a constant threshold, the clustering solutions for different

datasets from the Yahoo directory are measured to be the best. Compared to other criterion, e.g., the

maximal or minimal inter-cluster similarity among clusters, avgInter implies a characteristic for Web

page datasets.

This chapter also describes our new clustering algorithm called bi-directional hierarchical cluster-

ing. The new clustering algorithm arranges individual Web pages into clusters and then arranges the

clusters into larger clusters and so on until the average inter-cluster similarity approaches a constant

threshold. It produces a hierarchy of categories (see for example Figure 1), in which larger and more

general categories locate at the top while smaller and more specific categories locate at the bottom of the

hierarchy. Figure 1 shows the result of one of our experiments for clustering 766 Web pages to produce

a hierarchy of categories. The top (left-most) category contains all the Web pages (Dataset 1). The next

level consists of two categories, one of which has 94 Web pages and the other has 672 Web pages. Then,

each of the two categories has sub-categories and so on, as shown in the figure. This example shows that

our new clustering algorithm is able to handle categories of widely different sizes (such as 94 comparing

to 672 pages). By using two measures, purity and entropy, this example also shows that more general

categories (which have lower purity but higher entropy) locate at the top, while more specific categories

(which have higher purity but lower entropy) locate at the bottom of the hierarchy.

The rest of this chapter is organized as follows. The second section gives background and an over-

view of related methods. Our new bi-directional hierarchical clustering algorithm is presented in the

third section. The fourth section describes the Web page datasets used in our experiments. The fifth

section provides the experimental details for the discovery of a constant factor that characterizes the

Web domain. The sixth section shows how the constant factor is used for automatically discovering the

number of clusters. The seventh section provides the conclusion and future research.

Distributed Artificial Intelligence, Agent Technology, and Collaborative Applications

Search WWH ::

Custom Search

Home