Information Technology Reference
In-Depth Information
2
k
.
(10)
T
( ) (1
k
= −
) ( )
k
n
The number of clusters in a dataset is estimated to be
argmax T
k .
It can be noticed that the above methods cannot be used for estimating k =1 for a dataset. Some other
methods, e.g., Clest (Dudoit & Fridlyand, 2002), Hartigan (1985), and gap (Tibshirani et al., 2000) were
also found in literature.
In summary, most existing methods make use of the distance (or similarity) of inter-cluster and (or)
intra-cluster of a dataset. The problem is that none of them is satisfactory for all kinds of cluster analysis
(Dudoit & Fridlyand, 2002; Stehl, 2002). One reason may be that people have different opinions about
the granularity of clusters and there may be several right answers to k with respect to different desired
granularity. Unlike partitional (flat) clustering algorithms, hierarchical clustering algorithms may have
different k 's by cutting the dendrogram at different levels, hence providing flexibility for clustering
results.
In the next section we will present our new clustering algorithm which is used to cluster Web pages
and to estimate k for Web page datasets. Throughout this chapter, we use term “documents” or “Web
pages” to denote Web pages, the term “true class” to mean a class of Web pages which contains Web
pages labeled with the same class label, and the term “cluster” to denote a group of Web pages in which
Web pages may have different class labels.
2
Bi-directional
hierarchical
clustering
algorith
M
We present our new bi-directional hierarchical clustering (BHC) system (Yao & Choi, 2003, 2007) in
this section. The BHC system consists of three major steps:
1. Generating an initial sparse graph
2. Bottom-up cluster-merging phase
3. Top-down refining phase
These major steps are described in detail in the following subsections. Here we outline the workings of
the entire system. In the first phase, the BHC system takes a given dataset and generates an initial sparse
graph (e.g., Figure 2), where a node represents a cluster, and is connected to its k-nearest neighbors by
similarity-weighted edges. The BHC system then creates a hierarchical structure of clusters in the two
phases, the bottom-up cluster-merging phase and the top-down refinement phase. During the bottom-up
cluster-merging phase, two or more nodes (clusters) are merged together to form a larger cluster. Then,
again two or more clusters are merged and so on until a stopping condition is met. During the top-down
refinement phase, the BHC system eliminates the early errors that may occur in the greedy bottom-up
cluster-merging phase. It moves some nodes between clusters to minimize the inter-cluster similarities.
Thus, these two phases make items in a cluster more similar and make clusters more distinct from each
other. The key features of the BHC system are that it produces a hierarchical structure of clusters much
faster than the existing hierarchical clustering algorithms, and it improves clustering results using a
refinement process, as detailed in the following.
Search WWH ::




Custom Search