Cluster Analysis - Visual Data Mining: The VisMiner Approach

Databases Reference

In-Depth Information

Hierarchical Clustering

Another relatively simple method to perform cluster analysis is hierarchical

clustering which generates a taxonomy or hierarchy of clusters. It has two

alternative approaches: bottom-up and top-down.

In bottom-up hierarchical clustering, each observation is assigned to its own

cluster. Repeatedly, the closest two clusters are merged until only one cluster

remains or until each cluster reaches a predetermined minimum measure of

cluster cohesiveness.

In top-down hierarchical clustering, we begin with one cluster containing

all observations. Repeatedly, clusters are divided until all clusters reach a

predefined maximum measure of cluster cohesiveness. Top-down hierarchical

clustering has the additional complexity in that there needs to be a way to

select the next cluster to be split, and once selected, there needs to be a way

to allocate observations to the two newly created clusters. The process of

top-down clustering is similar to the process of tree building presented in

Chapter 4. In decision trees, the degree of homogeneity of a node is based on

the single classification variable and the split of a node is based on criteria that

result in the most homogeneous nodes. In cluster analysis there is no

classification variable. Hence, all attributes (dimensions) must be used to

compute a measure of cluster dispersion in selection of nodes to be split. This

will be discussed later when we present measures of individual cluster and

overall clustering quality.

Over the years numerous other methodologies have been proposed for cluster

analysis. Some have enhanced the existing K-means and hierarchical proximity

based methodologies, while others have focused on density or connectedness

based methodologies. Self-organizing maps, which will be introduced later in

this chapter, can be thought of as an enhanced K-means algorithm. For more

information on these algorithms, the reader is directed to topics dedicated

primarily to cluster analysis.

Measures of Cluster and Clustering Quality

Given that in cluster analysis we never know if we have “the correct answer”,

measures are needed to evaluate a clustering's quality. In general terms, a

clustering based on proximity is valid if we have clusters that individually are

cohesive (tightly packed around a centroid) and distinctly separated from the

other clusters in the clustering.

A measure of cluster cohesiveness is the sum of the squared error (SSE). SSE

is defined as the sum of the squared distances of each observation from the

cluster's centroid. More formally it is:

Search WWH ::

Custom Search

Home