Novel Trends in Clustering - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

assigned objects. (2). Re-assign objects: Assign

each object to its closest center. The algorithm

converges as soon as no object changes its cluster

assignment during two subsequent iterations. In

most cases, fast convergence can be observed. The

optimization function of K-means is well defined.

The algorithm minimizes the sum of squared

distances of the objects to their cluster centers.

This optimization goal coincides well with our

definition of the clustering problem provided in the

beginning: Objects assigned to a common cluster

should be as similar as possible. The second aspect

of the definition that objects in different clusters

should differ as much as possible is implicitly

addressed at the same time. However, finding a

global minimum of the objective function is a NP-

hard problem (see for example Meila 2008). The

objective function is non-linear and non-convex,

which implies that no efficient algorithm can be

provided to detect the global minimum exactly.

K-means converges to a local minimum of the

objective function in a very acceptable time frame.

In many cases, the result is close to optimal and

K-means is thus among the most wide-spread

clustering algorithms. In practice it is useful to

try different random initializations and keep the

best result. There are many algorithms following

the K-means paradigm, perhaps most importantly

the expectation maximization (EM) algorithm

(Dempster et al. 1997) to detect Gaussian mix-

ture models with fuzzy cluster assignment. As

K-means, the EM algorithm consists of two steps

which are iterated until convergence: (1) Update

centers to maximize the log-likelihood of the data

(this step is often called M-step where M stands

for maximization) and (2) assign objects propor-

tionally to their likelihood to all centers (this step

is called E-step where E stands for expectation,

since the expected value of the log-likelihood is

computed). Another important branch of iterative

partitioning clustering consists of K-medoid meth-

ods such as PAM (Partitioning around Medoids)

(Kaufmann & Rousseeuw 1990) or CLARANS

(Clustering Large Applications based on Ran-

domized Search) (Ng & Han 1994). Instead of

the mean, these methods select objects from the

data set as cluster representatives. Therefore, also

non-vector metric data can be clustered. However,

selecting a suitable K is major problem with all

these algorithms.

To avoid the problems with parameterization

and local minima, hierarchical or density-based

clustering can be an attractive alternative. Single

Link (Jain & Dubes 1988) is the basic algorithm

for hierarchical clustering. As result, this algo-

rithm produces a tree-style visualization of the

hierarchical cluster structure which is often called

dendrogram. At the lowest level of the hierarchy

all objects are represented as singleton clusters. In

each step the closest pairs of objects are merged to

form a cluster at the next higher level. Besides the

distance function, no parameterization is required.

There is no objective function to be minimized and

the result is determinate. The runtime of Single

Link is quadratic in the number of objects, which

is acceptable in most applications. However, the

result is often hard to interpret, especially for large

data sets, since the visualization is the only output

of the algorithm. If a partitioning into distinct

clusters is desired, it is difficult to find a suitable

level for horizontally cutting the hierarchy. In the

presence of noise objects, the so-called Single

Link effect can occur: Clusters may get connected

by a chain of noise objects. Popular variants of

Single Link which are somewhat less prone to the

Single Link effect areAverage Link and Complete

Link. These algorithms introduce distance func-

tions between sets of objects for clustering. In

Average Link the average distance between two

sets of objects is applied, in Complete Link the

maximum distance.

Strongly related to the cluster notion of Single

Link is the idea of density-based clustering. In

density-based clustering, clusters are regarded

areas of high object density which are separated

by areas of lower object density. The algorithm

DBSCAN (Density-Based Spatial Clustering of

Applications with Noise) proposed in (Ester et

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home