Clustering and Visualization of Retail Market Baskets - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

such as k-means and k-medioids, bottom-up hierarchical approaches such

as single-link or complete-link agglomerative clustering, soft partitioning ap-

proaches such as fuzzy clustering, EM-based techniques, and methods moti-

vated by statistical mechanics [3.32]. Although several methods of clustering

data defined by pairwise (dis)similarities are available [3.33], most classical

techniques, as well as recent techniques proposed in the data mining com-

munity (CLARANS, DBScan, BIRCH, CLIQUE, CURE, WaveCluster etc.

[3.34]), are based on distances between the samples in the original feature

space. The emphasis of the data mining-oriented proposals is primarily on

an e cient and scalable (with respect to number of records) implementation

of approximate k-means, k-medioids, or local density estimation. Thus they

are all faced with the “curse of dimensionality” [3.35] and the associated

sparsity issues, when dealing with very high-dimensional data. Essentially

the amount of data to sustain a given spatial density increases exponentially

with the dimensionality of the input space, or alternatively, the sparsity in-

creases exponentially given a constant amount of data, with points tending to

become equidistant from one another. In general, this will adversely impact

any method based on spatial density, unless the data follow certain simple

distributions as described in the introduction. Certain other limitations of

popular clustering methods are nicely illustrated in [3.9]. In [3.36], the au-

thors recognize that one way of tackling high-dimensional data is to change

the distance function in an application-specific way. They suggest some pos-

sible modified functions and principles but do not provide any experimental

results.

In databases, where clustering is often tied to the need for e cient index-

ing , a variety of space-partioning methods (e.g., R-trees and variants) and

data partitioning (such as KDB-trees), exist. These methods are typically

tractable for up to 10- to 15-dimensional data, and by a judicious hybrid of

these two approaches, data with tens of attributes may be partitioned [3.37].

Significant overlaps among the hyperrectangles the occurrences of several

empty areas become increasingly problematic if the dimensionality is further

increased (see [3.37] for more details).

Graph-theoretic clustering has been known for a while [3.14] though not

commonly applied. But lately, such an approach has proved attractive for

gene expression analysis [3.27]. Graphical methods also have emerged in the

data mining literature to tackle high-dimensional data analysis. ROCK (Ro-

bust Clustering using linKs) [3.7] is an agglomerative hierarchical clustering

technique for categorical attributes. It uses the binary Jaccard coe cient

and a thresholding criterion to establish links between samples. Common

neighbors are used to define interconnectivity of clusters, which is used to

merge clusters. CHAMELEON [3.9] starts with partitioning the data into

a large number of clusters by partitioning the v-nearest neighbor graph. In

the subsequent stage clusters are merged based on relative interconnectivity

and relative closeness measures. These localized measures lead to a dynamic

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home