Clustering and Visualization of Retail Market Baskets - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

's results for this example were obtained with a 1.7 GHz Pen-

tium 4 PC with 512 MB RAM in approximately 35 seconds (

Opossum

30s file I/O,

2.5s similarity computation, 0.5s conversion to integer weighted graph, 0.5s

graph partitioning). Figure 3.4 shows the extended Jaccard similarity matrix

(83% sparse) using

∼

in six scenarious: (a) original (randomly) ordered

matrix, (b) seriated using Euclidean k-means, (c) using SOM, (d) using stan-

dard Jaccard k-means, (e) using extended Jaccard sample balanced Opos-

sum, and (f) using value balanced Opossum clustering. Customer and rev-

enue ranges are given beneath each image. In (a), (b), (c), and (d) clusters are

neither compact nor balanced. In (e) and (f) clusters are much more compact,

even though there is the additional constraint that they be balanced, based

on an equal number of customers and equal revenue metrics, respectively. Be-

neath each

Clusion

visualization, the ranges of numbers of customers and

revenue total in money among the 20 clusters are given to indicate balance.

We also experimented with minimum distance agglomerative clustering but

this resulted in 19 singletons and one cluster with 2447 customers, so we did

not bother including this approach. Clearly, k-means in the original feature

space, the standard clustering algorithm, does not perform well (Fig. 3.4(b)).

The SOM after 100,000 epochs performs slightly better (Fig. 3.4(c)) but

is outperformed by the standard Jaccar d k-means (Fig. 3.4(d)), which is

Clusion

adopted to similarity space by using

−

log(s (J) ) as distances [3.13]. As

the relationship-based

(Fig. 3.4(e),(f)) gives more

compact (better separation of on- and off-diagonal regions) and well-balanced

clusters compared to all other techniques. For example, looking at standard

Jaccard k-means, the clusters contain between 48 and 597 customers con-

tributing between $608 and $70,443 to revenue. 6 Thus the clusters may not

be of comparable importance from a marketing standpoint. Moreover clus-

ters are hardly compact: Darkness is only slightly stronger in the on-diagonal

regions in Fig. 3.4(d). All visualizations have been histogram equalized for

printing purposes. However, they are still much better observed by browsing

interactively on a computer screen.

A very compact and useful way of profiling a cluster is to look at its

most descriptive and most discriminative features. For market basket data,

this can be done by looking at a cluster's highest-revenue products and the

most unusual revenue drivers (e.g., products with the highest revenue lift).

Revenue lift is the ratio of the average spending on a product in a particular

cluster to the average spending in the entire data set.

In Table 3.1 the top three descriptive and discriminative products for the

customers in the 20 value-balanced clusters are shown (see also Fig. 3.4(f)).

Customers in cluster

Clusion

shows,

Opossum

C 2 , for example, mostly spent their money on smoking-

cessation gum ($10.15 on average). Interestingly, while this is a 35-fold av-

erage spending on smoking cessation gum, these customers also spend 35

6 The solution for k-means depends on the initial choices for the means.Arepre-

sentative solution is shownhere.

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home