Database Reference
In-Depth Information
Figure 4. Subspace and Projected Clustering. (a) The curse of dimensionality: In one-dimensional space
there are two distinct clusters C1 and C2. In two-dimensional space the cluster structure is blurred due
to one noise dimension; (b) and (c): Objects can be differently clustered in different subspaces. For
example, in the subspace spanned by the features A and B, object 1 is clustered together with the objects
2, 3, and 4. In the subspace spanned by C and D, object 1 clustered together with the objects 2, 5, and
6. This information is preserved in subspace clustering but not in projected clustering
transform the data to a lower dimensional space.
However, at least for clustering, PCA is not the
best choice to cure the curse of dimensionality.
PCA preserves the overall variance in the data.
This implies, if the data has high variance without
cluster structure in the full dimensional space, the
clustering can not be much better in a subspace
selected by PCA.
Subspace clustering is a better solution. Many
high dimensional data sets exhibit rich cluster
structures in axis-parallel subspaces which are
spanned by subsets of the features. In our me-
tabolite scenario, clusters of patients can only
be identified in a subspace spanned by a subset
of the metabolites. In addition, patients can be
clustered differently in different subspaces. See
Figure 3 (b) and (c) for an illustration of this ef-
fect. The subspace clustering problem is highly
complex since a vector space of dimensional-
ity d has 2 d -1 axis-parallel subspaces. Thus, an
exhaustive search for the best subspaces is
infeasible in higher dimensions. The algorithm
CLIQUE (Clustering in Quest) (Agrawal et al.
1998), the fundamental approach to subspace
clustering, introduces a monotonic criterion for
object density which allows effective pruning in
combination with bottom-up search. To define the
density criterion, the data space is partitioned by
an axis-parallel grid into equal-sized units of width
ε. Only units whose densities exceed a threshold
τ are retained. A cluster is defined as a maximal
set of connected dense units. This cluster notion
allows effective pruning of the search space
using the upwards monotonicity of the density
criterion: Only subspaces containing dense units
may be part of a higher dimensional subspace
cluster. Successive modifications of CLIQUE with
similar algorithmic paradigm include ENCLUS
(Entropy-based Subspace Clustering) (Cheng et
al. 1998) and MAFIA (Merging Adaptive Finite
Intervals) (Nagesh et al. 2000). A drawback of
these methods is the use of grids. In general, the
efficiency and the accuracy of these approaches
heavily depend on the positioning and resolu-
tion of the grids. Objects that naturally belong
to a cluster may be missed or objects that are
naturally noise may be assigned to a cluster due
to an unfavorable grid position. The algorithm
SUBCLU (Density-connected Subspace Cluster-
ing) (Kailing et al. 2004) avoids this problem by
defining a monotonic density criterion relying
on concepts of density-based clustering without
grids. More precisely, the core object property
of DBSCAN is applied in the definition of the
density criterion. As a grid cell, the core object
property is upwards monotonic.
Search WWH ::




Custom Search