Novel Trends in Clustering - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

Recently proposed algorithms like SURFING

(Subspaces Relevant for Clustering) (Baumgartner

et al. 2004), SCHISM (Support and Chernoff-Ho-

effding Bound-based Interesting Subspace Miner)

(Sequeira & Zaki 2004) and DUSK (Dimension-

ality-unbiased Subspace Clustering) (Assent et

al. 2007) refrain from a cluster notion based on

a monotonic density criterion, since this leads to

major problems. With increasing dimensionality,

the object density naturally decreases. Therefore,

a fixed density criterion implicitly specifies the

dimensionality of subspace clusters which can

be detected by the algorithm. Parameterization

of such algorithms is difficult. Subspace clusters

of various dimensionalities cannot be detected

in a single run of the algorithm. However, with

the fixed density criterion, also the monotonicity

property is dropped and heuristic search strate-

gies are required. SURFING and DUSK propose

criteria to rate the interestingness of subspaces for

clustering based on statistics of the data distribu-

tion which allow detecting subspace clusters of

various dimensionalities. SCHISM also employs a

variable density criterion and heuristic search but

uses a grid-based data quantization as CLIQUE.

All described algorithms apply bottom-up search,

which implies that at least parts of the subspace

clusters must be visible in the one-dimensional

subspaces.

An alternative solution to the curse of dimen-

sionality in clustering is projected clustering. In

contrast to subspace clustering, algorithms for

projected clustering assign each data object to

only one distinct cluster and determine the best

subspace, or the best projection, for this cluster.

Instead of performing a bottom-up search, most

algorithms for projected clustering start in the full

dimensional space and, similar to K-means, itera-

tively optimize some objective function. The first

approach to projected clustering is the algorithm

PROCLUS (Projected Clustering) (Aggarwal et

al. 1999). Provided with the input parameters K

and l , PROCLUS returns a partitioning of the data

into K clusters having an average dimensionality

l . Objects not fitting well to any of the clusters

are assigned to noise. To achieve comparability

of distances among objects assigned to subspace

clusters of different subspace dimensionality, the

Manhattan distance is normalized by the subspace

dimensionality. In the initialization stage, PRO-

CLUS selects initial medoids as cluster representa-

tives from the data objects. In the iterative phase,

the set of medoids and their associated subspaces

are refined using a greedy hill-climbing technique.

After this iterative search, an additional pass over

the data is performed for refinement of clusters,

medoids and associated subspaces.

Procopiuc et al. (2002) introduce an alterna-

tive notion of projected clusters, called optimal

projective clusters, together with an algorithm

called DOC (Density-based Optimal Projective

Clustering) to find such clusters. An optimal pro-

jective cluster is defined using two parameters:

α specifying a fraction of the data objects and ω

specifying the width of a hypercube. An optimal

projected cluster is defined as a set of points C as-

sociated with a subspace of dimensions D such that

C contains more than α% points of the objects and

the projection of C onto the subspace spanned by

D is contained in a hyper-cube of width ω. Based

on Monte-Carlo sampling, the algorithm DOC

returns approximations of the optimal projected

clusters. DOC does not require the user to specify

K, but the parameterization remains difficult

for the same reasons as discussed for subspace

clustering with a fixed density criterion. In ad-

dition, the cluster notion of DOC does not care

about the data distribution inside the hypercube

representing a cluster. Not always the hypercube

contains exactly one cluster, it may contain several

clusters, noise points and empty space as well.

The algorithm PreDeCon (Subspace Preference

Weighted Density- connected Clustering) (Böhm

et al. 2004) partially avoids the problems of dif-

ficult parameterization and undesired behavior

in the presence of noise by extending the basic

concepts of density-based clustering to projected

clustering. In this approach, the notion of subspace

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home