Information Technology Reference
In-Depth Information
similar to one another than data in other clusters (Berkhin, 2002; Everitt et al., 2001; Jain & Dubes,
1998; Jain et al., 1999). Typical application areas for clustering include artificial intelligence, biology,
data mining, information retrieval, image processing, marketing, pattern recognition, statistics, and
Web mining (Berkhin, 2002; Everitt et al., 2001; Jain et al., 1999). Compared to classification methods,
cluster analysis has the advantage that it does not require any training data (i.e., the labeled data), but
can achieve the same goal in that it can arrange similar Web pages into groups.
The major aspects of the clustering problem for organizing Web pages are: To find the number of clusters
k in a Web page dataset, and to assign Web pages accurately to their clusters. Much work (Agrawal et al.,
1998; Dhillon et al., 2001; Ester et al., 1996; Guha et al., 1998a; Guha et al., 1998b; Hinneburg & Keim,
1999; Karypis & Kumar, 1999; Ng & Han, 1994; Rajaraman & Pan, 2000; Sander et al., 1998; Tantrum
et al., 2002; Yao & Karypis, 2001; Zhang et al., 1996; Zhao & Karypis, 1999) has been done to improve
the accuracy of assigning data to clusters in different domains, whereas no satisfactory method has been
found to estimate k in a dataset (Dudoit & Fridlyand, 2002; Strehl, 2002) though many methods were
proposed (Davies & Bouldin, 1979; Dudoit & Fridland, 2002; Milligan & Cooper, 1985). As a matter of
fact, finding k in a dataset is still a challenge in cluster analysis (Strehl, 2002). Almost all existing work
in this area assumes that k is known for clustering a dataset (e.g., Karypis et al., 1999; Zhao & Karypis,
1999). However in many applications, this is not true because there is little prior knowledge available
for cluster analysis except the feature space or the similarity space of a dataset.
This chapter addresses the problem of estimating k for Web page datasets. By testing many existing
methods for estimating k for datasets, we find that only the average inter-cluster similarity ( avgInter )
need to be used as the criterion to discover k for a Web page dataset. Our experiments show that when
the avgInter for a Web page dataset reaches a constant threshold, the clustering solutions for different
datasets from the Yahoo directory are measured to be the best. Compared to other criterion, e.g., the
maximal or minimal inter-cluster similarity among clusters, avgInter implies a characteristic for Web
page datasets.
This chapter also describes our new clustering algorithm called bi-directional hierarchical cluster-
ing. The new clustering algorithm arranges individual Web pages into clusters and then arranges the
clusters into larger clusters and so on until the average inter-cluster similarity approaches a constant
threshold. It produces a hierarchy of categories (see for example Figure 1), in which larger and more
general categories locate at the top while smaller and more specific categories locate at the bottom of the
hierarchy. Figure 1 shows the result of one of our experiments for clustering 766 Web pages to produce
a hierarchy of categories. The top (left-most) category contains all the Web pages (Dataset 1). The next
level consists of two categories, one of which has 94 Web pages and the other has 672 Web pages. Then,
each of the two categories has sub-categories and so on, as shown in the figure. This example shows that
our new clustering algorithm is able to handle categories of widely different sizes (such as 94 comparing
to 672 pages). By using two measures, purity and entropy, this example also shows that more general
categories (which have lower purity but higher entropy) locate at the top, while more specific categories
(which have higher purity but lower entropy) locate at the bottom of the hierarchy.
The rest of this chapter is organized as follows. The second section gives background and an over-
view of related methods. Our new bi-directional hierarchical clustering algorithm is presented in the
third section. The fourth section describes the Web page datasets used in our experiments. The fifth
section provides the experimental details for the discovery of a constant factor that characterizes the
Web domain. The sixth section shows how the constant factor is used for automatically discovering the
number of clusters. The seventh section provides the conclusion and future research.
Search WWH ::




Custom Search