Database Reference
In-Depth Information
such as k-means and k-medioids, bottom-up hierarchical approaches such
as single-link or complete-link agglomerative clustering, soft partitioning ap-
proaches such as fuzzy clustering, EM-based techniques, and methods moti-
vated by statistical mechanics [3.32]. Although several methods of clustering
data defined by pairwise (dis)similarities are available [3.33], most classical
techniques, as well as recent techniques proposed in the data mining com-
munity (CLARANS, DBScan, BIRCH, CLIQUE, CURE, WaveCluster etc.
[3.34]), are based on distances between the samples in the original feature
space. The emphasis of the data mining-oriented proposals is primarily on
an e cient and scalable (with respect to number of records) implementation
of approximate k-means, k-medioids, or local density estimation. Thus they
are all faced with the “curse of dimensionality” [3.35] and the associated
sparsity issues, when dealing with very high-dimensional data. Essentially
the amount of data to sustain a given spatial density increases exponentially
with the dimensionality of the input space, or alternatively, the sparsity in-
creases exponentially given a constant amount of data, with points tending to
become equidistant from one another. In general, this will adversely impact
any method based on spatial density, unless the data follow certain simple
distributions as described in the introduction. Certain other limitations of
popular clustering methods are nicely illustrated in [3.9]. In [3.36], the au-
thors recognize that one way of tackling high-dimensional data is to change
the distance function in an application-specific way. They suggest some pos-
sible modified functions and principles but do not provide any experimental
results.
In databases, where clustering is often tied to the need for e cient index-
ing , a variety of space-partioning methods (e.g., R-trees and variants) and
data partitioning (such as KDB-trees), exist. These methods are typically
tractable for up to 10- to 15-dimensional data, and by a judicious hybrid of
these two approaches, data with tens of attributes may be partitioned [3.37].
Significant overlaps among the hyperrectangles the occurrences of several
empty areas become increasingly problematic if the dimensionality is further
increased (see [3.37] for more details).
Graph-theoretic clustering has been known for a while [3.14] though not
commonly applied. But lately, such an approach has proved attractive for
gene expression analysis [3.27]. Graphical methods also have emerged in the
data mining literature to tackle high-dimensional data analysis. ROCK (Ro-
bust Clustering using linKs) [3.7] is an agglomerative hierarchical clustering
technique for categorical attributes. It uses the binary Jaccard coe cient
and a thresholding criterion to establish links between samples. Common
neighbors are used to define interconnectivity of clusters, which is used to
merge clusters. CHAMELEON [3.9] starts with partitioning the data into
a large number of clusters by partitioning the v-nearest neighbor graph. In
the subsequent stage clusters are merged based on relative interconnectivity
and relative closeness measures. These localized measures lead to a dynamic
Search WWH ::




Custom Search