Database Reference
In-Depth Information
al. 1996) formalizes this idea by two parameters:
MinPts specifying a number of objects and ε
specifying a volume. An object is called core
object if it has at least MinPts objects within
its ε− neighbourhood. If one object P is in the
ε-neighbourhood of a core-object Q , then P is
said to be directly density reachable from Q . The
density-connectivity is the symmetric, transitive
closure of the direct density reachability , and a
density-based ( ε, MinPts ) -cluster is defined as a
maximal set of density-connected objects. It can
be proven that an ( ε, MinPts ) -cluster can be de-
tected by collecting all density reachable objects
starting from an arbitrary core object which is
implemented in DBSCAN. See Figure 2 for an
illustration of the definitions of DBSCAN. As K-
means, DBSCAN determines a non-hierarchical,
disjoint partitioning of the data set into clusters.
However, the number of clusters does not need be
specified in advance and the algorithm is robust
against noise objects. Nevertheless, the number
of clusters detected by DBSCAN depends on the
choice of the parameters ε and MinPts . For some
data sets even no suitable parameterization exists,
for example in the case of various object densities
in different areas of the data space or in the case
of a hierarchical cluster structure. To cope with
these problems, the algorithm OPTICS (Order-
ing Points to Identify the Clustering Structure)
(Ankerst et al. 1999) has been proposed which
is a hierarchical extension of DBSCAN but also
related to Single Link. The main idea of OPTICS
is to compute all possible clusterings for varying
ε simultaneously during a single traversal of the
data set. The output of OPTICS is a linear order
of the data objects according to their hierarchical
cluster structure which is visualized in the so-called
reachability-plot. OPTICS is equivalent to Single-
Link if the MinPts -Parameter of OPTICS is set
to one. OPTICS avoids the Single Link effect if
MinPts is set to larger values.
We introduced K-means and Single Link as
representatives of two different clustering para-
digms. It is important to point out that clustering
algorithms differ in their cluster notion. K-means
provides a well defined objective function which
intuitively coincides with our idea of clustering.
But this specific objective function also narrows
the type of clusters which can be detected: Only
spherically shaped Gaussian clusters are captured
by this definition. Single Link has a more general
cluster notion: Arbitrarily shaped dense areas
of the feature space are regarded as clusters. As
discussed, the algorithms further differ in the need
for parameter settings and the type of the result.
When introducing recent clustering paradigms,
we will always re-visit these points since they
are essential for the choice of a suitable clustering
algorithm for a certain application.
Figure 3 (a) displays a simple two-dimensional
data set consisting of three Gaussian clusters. For
comparison, Figure 3(b) displays the result of the
iterative partitioning EM algorithm and Figure 3(c)
the result of the hierarchical algorithm Single Link
on this data set. Correctly parameterized with K=3
the result of EM clustering is a Gaussian mixture
model with a good fit to the data. This result in-
cludes location and variance of each cluster which
is often important for interpretation. The result of
Single Link is a hierarchical visualization of the
cluster structure. The three clusters are clearly vis-
ible in the dendrogram. However, the dendrogram
always implies a hierarchical structure, even if
there is no distinct cluster hierarchy in the data as
in this example. In contrast to the result of EM,
the dendrogram does not provide a model on the
data. But note that Single Link does not require
any input parameters whereas the result of EM
strongly depends on a suitable parameterization.
Figure 2. Definitions of DBSCAN
Search WWH ::




Custom Search