Clustering and Visualization of Retail Market Baskets - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

In summary, the challenges mainly arise from two aspects: 3 (i) large num-

bers of data samples, n; and (ii) each sample having a large number of at-

tributes or features (dimensions, d), which cannot be readily simplified with-

out much information loss.

The first aspect can be dealt with by subsampling the data, exploiting

summary statistics, aggregating or “rolling up” to consider data at a coarser

resolution, or using approximating heuristics that reduce computation time

at the cost of some loss in quality. See chap. 8 [3.8] for several examples of

such approaches. This chapter provides a way of addressing the second aspect

by describing an alternative way of clustering and visualization when, even

after feature reduction, one is left with hundreds of dimensions per object

(and further reduction will significantly degrade the results), and moreover,

simplifying data modeling assumptions are also not valid. Because clustering

basically involves grouping objects based on their interrelationships or simi-

larities, one can alternatively work in similarity space instead of the original

feature space. The key insight in this work is that if one can find a similarity

measure (derived from the object features) that is appropriate for the problem

domain, then a single number can capture the essential “closeness” of a given

pair of objects, and any further analysis can be based only on these numbers.

This can be of great benefit when the data are very high-dimensional and

simplifications such as further reducing dimensionality through projections or

assuming conditional independence of features are not appropriate. Indeed,

several researchers have recently proposed similarity-based clustering tech-

niques for data mining applications [3.7], [3.9], [3.10], and this is emerging as

an active area of research.

The similarity space also lends itself to a simple technique to visualize the

clustering results. A major contribution of this chapter is to demonstrate that

this technique has increased power when the clustering method used contains

ordering information (e.g., top-down). Popular clustering methods in feature

space are either nonhierarchical (as in k-means) or bottom-up (agglomer-

ative clustering). However, if one transforms the clustering problem into a

related problem of partitioning a similarity graph, several powerful partition-

ing methods with ordering properties can be applied. Moreover, the overall

framework is quite generally applicable if one can determine the appropriate

similarity measure for a given situation.

We begin by considering domain-specific transformations into similarity

space in Section 3.2. Section 3.3 describes a specific clustering technique for

transaction data (

), based on a multilevel graph-partitioning algo-

rithm [3.11]. In Section 3.4, we describe a simple but effective visualization

technique applicable to similarity spaces (

Opossum

). Clustering and visual-

ization results are presented in Section 3.5. In Section 3.6, we consider system

Clusion

3 A thirdissueofhow to deal with seasonality and other temporal variations in

the data is also critical in some applications. This aspect is not within the scope

of this chapter, but see [3.4] for a solution for retail data.

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home