Database Reference
In-Depth Information
assigned objects. (2). Re-assign objects: Assign
each object to its closest center. The algorithm
converges as soon as no object changes its cluster
assignment during two subsequent iterations. In
most cases, fast convergence can be observed. The
optimization function of K-means is well defined.
The algorithm minimizes the sum of squared
distances of the objects to their cluster centers.
This optimization goal coincides well with our
definition of the clustering problem provided in the
beginning: Objects assigned to a common cluster
should be as similar as possible. The second aspect
of the definition that objects in different clusters
should differ as much as possible is implicitly
addressed at the same time. However, finding a
global minimum of the objective function is a NP-
hard problem (see for example Meila 2008). The
objective function is non-linear and non-convex,
which implies that no efficient algorithm can be
provided to detect the global minimum exactly.
K-means converges to a local minimum of the
objective function in a very acceptable time frame.
In many cases, the result is close to optimal and
K-means is thus among the most wide-spread
clustering algorithms. In practice it is useful to
try different random initializations and keep the
best result. There are many algorithms following
the K-means paradigm, perhaps most importantly
the expectation maximization (EM) algorithm
(Dempster et al. 1997) to detect Gaussian mix-
ture models with fuzzy cluster assignment. As
K-means, the EM algorithm consists of two steps
which are iterated until convergence: (1) Update
centers to maximize the log-likelihood of the data
(this step is often called M-step where M stands
for maximization) and (2) assign objects propor-
tionally to their likelihood to all centers (this step
is called E-step where E stands for expectation,
since the expected value of the log-likelihood is
computed). Another important branch of iterative
partitioning clustering consists of K-medoid meth-
ods such as PAM (Partitioning around Medoids)
(Kaufmann & Rousseeuw 1990) or CLARANS
(Clustering Large Applications based on Ran-
domized Search) (Ng & Han 1994). Instead of
the mean, these methods select objects from the
data set as cluster representatives. Therefore, also
non-vector metric data can be clustered. However,
selecting a suitable K is major problem with all
these algorithms.
To avoid the problems with parameterization
and local minima, hierarchical or density-based
clustering can be an attractive alternative. Single
Link (Jain & Dubes 1988) is the basic algorithm
for hierarchical clustering. As result, this algo-
rithm produces a tree-style visualization of the
hierarchical cluster structure which is often called
dendrogram. At the lowest level of the hierarchy
all objects are represented as singleton clusters. In
each step the closest pairs of objects are merged to
form a cluster at the next higher level. Besides the
distance function, no parameterization is required.
There is no objective function to be minimized and
the result is determinate. The runtime of Single
Link is quadratic in the number of objects, which
is acceptable in most applications. However, the
result is often hard to interpret, especially for large
data sets, since the visualization is the only output
of the algorithm. If a partitioning into distinct
clusters is desired, it is difficult to find a suitable
level for horizontally cutting the hierarchy. In the
presence of noise objects, the so-called Single
Link effect can occur: Clusters may get connected
by a chain of noise objects. Popular variants of
Single Link which are somewhat less prone to the
Single Link effect areAverage Link and Complete
Link. These algorithms introduce distance func-
tions between sets of objects for clustering. In
Average Link the average distance between two
sets of objects is applied, in Complete Link the
maximum distance.
Strongly related to the cluster notion of Single
Link is the idea of density-based clustering. In
density-based clustering, clusters are regarded
areas of high object density which are separated
by areas of lower object density. The algorithm
DBSCAN (Density-Based Spatial Clustering of
Applications with Noise) proposed in (Ester et
Search WWH ::




Custom Search