Databases Reference
In-Depth Information
Hierarchical Clustering
Another relatively simple method to perform cluster analysis is hierarchical
clustering which generates a taxonomy or hierarchy of clusters. It has two
alternative approaches: bottom-up and top-down.
In bottom-up hierarchical clustering, each observation is assigned to its own
cluster. Repeatedly, the closest two clusters are merged until only one cluster
remains or until each cluster reaches a predetermined minimum measure of
cluster cohesiveness.
In top-down hierarchical clustering, we begin with one cluster containing
all observations. Repeatedly, clusters are divided until all clusters reach a
predefined maximum measure of cluster cohesiveness. Top-down hierarchical
clustering has the additional complexity in that there needs to be a way to
select the next cluster to be split, and once selected, there needs to be a way
to allocate observations to the two newly created clusters. The process of
top-down clustering is similar to the process of tree building presented in
Chapter 4. In decision trees, the degree of homogeneity of a node is based on
the single classification variable and the split of a node is based on criteria that
result in the most homogeneous nodes. In cluster analysis there is no
classification variable. Hence, all attributes (dimensions) must be used to
compute a measure of cluster dispersion in selection of nodes to be split. This
will be discussed later when we present measures of individual cluster and
overall clustering quality.
Over the years numerous other methodologies have been proposed for cluster
analysis. Some have enhanced the existing K-means and hierarchical proximity
based methodologies, while others have focused on density or connectedness
based methodologies. Self-organizing maps, which will be introduced later in
this chapter, can be thought of as an enhanced K-means algorithm. For more
information on these algorithms, the reader is directed to topics dedicated
primarily to cluster analysis.
Measures of Cluster and Clustering Quality
Given that in cluster analysis we never know if we have “the correct answer”,
measures are needed to evaluate a clustering's quality. In general terms, a
clustering based on proximity is valid if we have clusters that individually are
cohesive (tightly packed around a centroid) and distinctly separated from the
other clusters in the clustering.
A measure of cluster cohesiveness is the sum of the squared error (SSE). SSE
is defined as the sum of the squared distances of each observation from the
cluster's centroid. More formally it is:
 
Search WWH ::




Custom Search