Clustering and Visualization of Retail Market Baskets - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

Note that the majority category, purity, and entropy are available only

where a supervised categorization is given. Of course the categorization can-

not be used to tune the clustering. Clusters 1 and 2 contain only documents

from the Health category so they are highly related. The fourth cluster,

which is indicated to be weak by

, in fact has the lowest purity in

the group, with 38% of documents from the most dominant category ( film ).

Clusion also suggests that cluster 3 is not only strong, as indicated by the

dark diagonal region, but that it also has distinctly above average relation-

ships to the other four clusters. On inspecting the word stems typifying this

cluster (Apple, Intel, and electron(ics)), it is apparent that this is because

of the interdisciplinary appearance of technology-savvy words in recent news

releases. Because such cluster descriptions might not be so easily available

or well understood in all domains, the intuitive display of

Clusion

is very

useful.

Clusion

has several other powerful properties. For example, it can be

integrated with product hierarchies (metadata) to provide simultaneous cus-

tomer and product clustering, as well as multilevel views and summaries. It

also has a graphical user interface so one can interactively browse, split, or

merge a data set, which is of great help to speed up the iterations of analysis

during a data-mining project.

3.5 Experiments

3.5.1 Retail Market Basket Clusters

Let us illustrate clustering in a real retail transaction database of 21,672

customers of a drugstore. 5 For the illustrative purpose of this chapter,

we randomly selected 2500 customers. The total number of transactions

(cash-register scans) for these customers is 33,814 over three months. We

rolled up the product hierarchy once to obtain 1236 different products pur-

chased. Fifteen percept of the total revenue is contributed by the single item

Financial-Depts (on-site financial services such as check cashing and bill

payment), which was removed because it was too common. Of these, 473

products accounted for less than $25 each in total and were dropped. The

remaining n = 2466 customers (34 customers had empty baskets after remov-

ing the irrelevant products) with their d = 762 features were clustered using

Opossum

. The extended price was used as the feature entries to represent

purchased quantity weighted according to price.

In this customer clustering case study we set k = 20. In this application

domain, the number of clusters is often predetermined by marketing consider-

ations such as advertising industry standards, marketing budgets, marketers'

ability to handle multiple groups, and the cost of personalization. In general,

a reasonable value of k can be obtained using heuristics (Section 3.3.3).

5 provided by Knowledge Discovery 1.

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home