Clustering and Visualization of Retail Market Baskets - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

adaption capability with spectacular results for two-dimensional data. But

its effectiveness and interpretability for higher-dimensional data are not re-

ported. In [3.38], a hypergraph clustering approach was taken for clustering

highly related items defined in high-dimensional space and generate the cor-

responding association rules. This method was applied to binarized data,

with each frequent item set being represented by a weighted hyperedge. Like

our method, it is suitable for high-dimensional data and is linear in d. Sub-

sequently, this and another graph partitioning algorithm called principal di-

rection divisive partitioning were applied for Web document categorization

[3.39]. These two algorithms are the closest in spirit to our approach. Fi-

nally, spectral partitioning methods [3.17], [3.40] can be applied to similarity

graphs. A probabilistic foundation for spectral methods for clustering and

segmentation has been recently proposed [3.41].

Related work on scalability issues of clustering are discussed in Section

3.6.2. Visualization of high-dimensional data clusters can be largely divided

into three popular approaches:

1. Dimensionality reduction by selection of two or three dimensions, or,

more generally, projecting the data down to two or three dimensions.

Often these dimensions correspond to principal components or a scal-

able approximation thereof. Another noteworthy method is CViz [3.42],

which projects onto the plane that passes through three selected cluster

centroids to yield a “discrimination optimal” two-dimensional projection.

These projections are useful for a medium number of dimensions, i.e., if

d is not too large (< 100). 8 Nonlinear projections have also been studied

[3.43]. Re-creating a two- or three-dimensional space from a similarity

graph can also be done through multidimensional scaling [3.44].

2. Parallel-axis plots show each object as a line along d parallel axes. How-

ever, this technique is rendered ineffective if the number of dimensions d

or the number of objects gets too high.

3. Kohonen's self organizing map (SOM) [3.45] provides an innovative and

powerful way of clustering while enforcing constraints on a logical topol-

ogy imposed on the cluster centers. If this topology is two-dimensional,

one can readily “visualize” the clustering of data. Essentially a two-

dimensional manifold is mapped onto the (typically higher-dimensional)

feature space, trying to approximate data density while maintaining topo-

logical constraints. Because the mapping is not bijective, the quality can

degrade very rapidly with increasing dimensionality of the feature space,

unless the data are largely confined to a much lower-order manifold within

this space [3.43]. Multidimensional scaling (MDS) and associated meth-

ods also face similar issues.

8 For text mining, linearly projecting down to about 20 to 50 dimensions does not

affect results much (e.g., latent semantic indexing). However, it is still too high

to visualize.Aprojection to lower dimensions leads to substantial degradation

and 3-dimensional projections are of very limited utility.

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home