Clustering and Visualization of Retail Market Baskets - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

customer data. Filtering these outliers may not be easy or desirable because

they could be very important (e.g., major revenue contributors). In addition,

features are often neither nominal nor continuous, but they may have discrete

positive ordinal attribute values, with a strongly non-Gaussian distribution.

One way to reduce the feature space is to consider only the most dominant

products (attribute selection), but in practice this may still leave hundreds of

products to be considered. And because product popularity tends to follow a

Zipf distribution [3.5], the tail is “heavy,” meaning that revenue contribution

from the less-popular products is significant for certain customers. Moreover,

in retail, the higher profit margins are often associated with less-popular

products. One can do a “roll-up” to reduce the number of products, but

with a corresponding loss in resolution or granularity. Feature extraction

or transformation is typically not carried out, as derived features lose the

semantics of the original ones as well as the sparsity property.

The alternative to attribute reduction is to try “simplification via model-

ing.” One approach would be to consider only binary features (bought or not).

This reduces each transaction to an unordered set of the purchased products.

Thus one can use techniques such as the a priori algorithm to determine as-

sociations or rules. In fact, this is currently the most popular approach to

market basket analysis (see chap. 8 [3.6]). Unfortunately, this results in loss of

vital information: one cannot differentiate between buying one gallon of milk

and 100 gallons of milk, nor one can weight importance between buying an

apple versus buying a car, though clearly these are very different situations

from a business perspective. In general, association-based rules derived from

such sets will be inferior when revenue or profits are the primary performance

indicators, because the simplified data representation loses information about

quantity, price, or margins. The other broad class of modeling simplifications

for market basket analysis is based on taking a macrolevel view of the data

having characteristics capturable in a small number of parameters. In retail, a

5-dimensional model for customers composed from indicators for recency, fre-

quency, monetary value, volume, and tenure (RFMVT) is popular. However,

this useful model is at a much lower resolution than looking at individual

products and fails to capture actual purchasing behavior in more complex

ways such as taste/brand preferences or price sensitivity,

Due to these characteristics, it is not surprising that traditional metric

vector space-based clustering techniques work poorly on real-life market bas-

ket data. For example, a typical result of hierarchical agglomerative clustering

(both single-link and complete-link approaches) on market basket data are

to obtain one huge cluster near the origin, because most customers buy very

few items, 2 and a few scattered clusters otherwise. Applying k-means could

forceably split this huge cluster into segments depending on the initialization,

but not in a meaningful manner.

2 This is the dilution effect described in [3.7].

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home