Novel Trends in Clustering - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

Figure 4. Subspace and Projected Clustering. (a) The curse of dimensionality: In one-dimensional space

there are two distinct clusters C1 and C2. In two-dimensional space the cluster structure is blurred due

to one noise dimension; (b) and (c): Objects can be differently clustered in different subspaces. For

example, in the subspace spanned by the features A and B, object 1 is clustered together with the objects

2, 3, and 4. In the subspace spanned by C and D, object 1 clustered together with the objects 2, 5, and

6. This information is preserved in subspace clustering but not in projected clustering

transform the data to a lower dimensional space.

However, at least for clustering, PCA is not the

best choice to cure the curse of dimensionality.

PCA preserves the overall variance in the data.

This implies, if the data has high variance without

cluster structure in the full dimensional space, the

clustering can not be much better in a subspace

selected by PCA.

Subspace clustering is a better solution. Many

high dimensional data sets exhibit rich cluster

structures in axis-parallel subspaces which are

spanned by subsets of the features. In our me-

tabolite scenario, clusters of patients can only

be identified in a subspace spanned by a subset

of the metabolites. In addition, patients can be

clustered differently in different subspaces. See

Figure 3 (b) and (c) for an illustration of this ef-

fect. The subspace clustering problem is highly

complex since a vector space of dimensional-

ity d has 2 d -1 axis-parallel subspaces. Thus, an

exhaustive search for the best subspaces is

infeasible in higher dimensions. The algorithm

CLIQUE (Clustering in Quest) (Agrawal et al.

1998), the fundamental approach to subspace

clustering, introduces a monotonic criterion for

object density which allows effective pruning in

combination with bottom-up search. To define the

density criterion, the data space is partitioned by

an axis-parallel grid into equal-sized units of width

ε. Only units whose densities exceed a threshold

τ are retained. A cluster is defined as a maximal

set of connected dense units. This cluster notion

allows effective pruning of the search space

using the upwards monotonicity of the density

criterion: Only subspaces containing dense units

may be part of a higher dimensional subspace

cluster. Successive modifications of CLIQUE with

similar algorithmic paradigm include ENCLUS

(Entropy-based Subspace Clustering) (Cheng et

al. 1998) and MAFIA (Merging Adaptive Finite

Intervals) (Nagesh et al. 2000). A drawback of

these methods is the use of grids. In general, the

efficiency and the accuracy of these approaches

heavily depend on the positioning and resolu-

tion of the grids. Objects that naturally belong

to a cluster may be missed or objects that are

naturally noise may be assigned to a cluster due

to an unfavorable grid position. The algorithm

SUBCLU (Density-connected Subspace Cluster-

ing) (Kailing et al. 2004) avoids this problem by

defining a monotonic density criterion relying

on concepts of density-based clustering without

grids. More precisely, the core object property

of DBSCAN is applied in the definition of the

density criterion. As a grid cell, the core object

property is upwards monotonic.

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home