Database Reference
In-Depth Information
's results for this example were obtained with a 1.7 GHz Pen-
tium 4 PC with 512 MB RAM in approximately 35 seconds (
Opossum
30s file I/O,
2.5s similarity computation, 0.5s conversion to integer weighted graph, 0.5s
graph partitioning). Figure 3.4 shows the extended Jaccard similarity matrix
(83% sparse) using
in six scenarious: (a) original (randomly) ordered
matrix, (b) seriated using Euclidean k-means, (c) using SOM, (d) using stan-
dard Jaccard k-means, (e) using extended Jaccard sample balanced Opos-
sum, and (f) using value balanced Opossum clustering. Customer and rev-
enue ranges are given beneath each image. In (a), (b), (c), and (d) clusters are
neither compact nor balanced. In (e) and (f) clusters are much more compact,
even though there is the additional constraint that they be balanced, based
on an equal number of customers and equal revenue metrics, respectively. Be-
neath each
Clusion
visualization, the ranges of numbers of customers and
revenue total in money among the 20 clusters are given to indicate balance.
We also experimented with minimum distance agglomerative clustering but
this resulted in 19 singletons and one cluster with 2447 customers, so we did
not bother including this approach. Clearly, k-means in the original feature
space, the standard clustering algorithm, does not perform well (Fig. 3.4(b)).
The SOM after 100,000 epochs performs slightly better (Fig. 3.4(c)) but
is outperformed by the standard Jaccar d k-means (Fig. 3.4(d)), which is
Clusion
adopted to similarity space by using
log(s (J) ) as distances [3.13]. As
the relationship-based
(Fig. 3.4(e),(f)) gives more
compact (better separation of on- and off-diagonal regions) and well-balanced
clusters compared to all other techniques. For example, looking at standard
Jaccard k-means, the clusters contain between 48 and 597 customers con-
tributing between $608 and $70,443 to revenue. 6 Thus the clusters may not
be of comparable importance from a marketing standpoint. Moreover clus-
ters are hardly compact: Darkness is only slightly stronger in the on-diagonal
regions in Fig. 3.4(d). All visualizations have been histogram equalized for
printing purposes. However, they are still much better observed by browsing
interactively on a computer screen.
A very compact and useful way of profiling a cluster is to look at its
most descriptive and most discriminative features. For market basket data,
this can be done by looking at a cluster's highest-revenue products and the
most unusual revenue drivers (e.g., products with the highest revenue lift).
Revenue lift is the ratio of the average spending on a product in a particular
cluster to the average spending in the entire data set.
In Table 3.1 the top three descriptive and discriminative products for the
customers in the 20 value-balanced clusters are shown (see also Fig. 3.4(f)).
Customers in cluster
Clusion
shows,
Opossum
C 2 , for example, mostly spent their money on smoking-
cessation gum ($10.15 on average). Interestingly, while this is a 35-fold av-
erage spending on smoking cessation gum, these customers also spend 35
6 The solution for k-means depends on the initial choices for the means.Arepre-
sentative solution is shownhere.
 
Search WWH ::




Custom Search