Database Reference
In-Depth Information
customer data. Filtering these outliers may not be easy or desirable because
they could be very important (e.g., major revenue contributors). In addition,
features are often neither nominal nor continuous, but they may have discrete
positive ordinal attribute values, with a strongly non-Gaussian distribution.
One way to reduce the feature space is to consider only the most dominant
products (attribute selection), but in practice this may still leave hundreds of
products to be considered. And because product popularity tends to follow a
Zipf distribution [3.5], the tail is “heavy,” meaning that revenue contribution
from the less-popular products is significant for certain customers. Moreover,
in retail, the higher profit margins are often associated with less-popular
products. One can do a “roll-up” to reduce the number of products, but
with a corresponding loss in resolution or granularity. Feature extraction
or transformation is typically not carried out, as derived features lose the
semantics of the original ones as well as the sparsity property.
The alternative to attribute reduction is to try “simplification via model-
ing.” One approach would be to consider only binary features (bought or not).
This reduces each transaction to an unordered set of the purchased products.
Thus one can use techniques such as the a priori algorithm to determine as-
sociations or rules. In fact, this is currently the most popular approach to
market basket analysis (see chap. 8 [3.6]). Unfortunately, this results in loss of
vital information: one cannot differentiate between buying one gallon of milk
and 100 gallons of milk, nor one can weight importance between buying an
apple versus buying a car, though clearly these are very different situations
from a business perspective. In general, association-based rules derived from
such sets will be inferior when revenue or profits are the primary performance
indicators, because the simplified data representation loses information about
quantity, price, or margins. The other broad class of modeling simplifications
for market basket analysis is based on taking a macrolevel view of the data
having characteristics capturable in a small number of parameters. In retail, a
5-dimensional model for customers composed from indicators for recency, fre-
quency, monetary value, volume, and tenure (RFMVT) is popular. However,
this useful model is at a much lower resolution than looking at individual
products and fails to capture actual purchasing behavior in more complex
ways such as taste/brand preferences or price sensitivity,
Due to these characteristics, it is not surprising that traditional metric
vector space-based clustering techniques work poorly on real-life market bas-
ket data. For example, a typical result of hierarchical agglomerative clustering
(both single-link and complete-link approaches) on market basket data are
to obtain one huge cluster near the origin, because most customers buy very
few items, 2 and a few scattered clusters otherwise. Applying k-means could
forceably split this huge cluster into segments depending on the initialization,
but not in a meaningful manner.
2 This is the dilution effect described in [3.7].
Search WWH ::




Custom Search