Database Reference
In-Depth Information
Note that the majority category, purity, and entropy are available only
where a supervised categorization is given. Of course the categorization can-
not be used to tune the clustering. Clusters 1 and 2 contain only documents
from the Health category so they are highly related. The fourth cluster,
which is indicated to be weak by
, in fact has the lowest purity in
the group, with 38% of documents from the most dominant category ( film ).
Clusion also suggests that cluster 3 is not only strong, as indicated by the
dark diagonal region, but that it also has distinctly above average relation-
ships to the other four clusters. On inspecting the word stems typifying this
cluster (Apple, Intel, and electron(ics)), it is apparent that this is because
of the interdisciplinary appearance of technology-savvy words in recent news
releases. Because such cluster descriptions might not be so easily available
or well understood in all domains, the intuitive display of
Clusion
Clusion
is very
useful.
Clusion
has several other powerful properties. For example, it can be
integrated with product hierarchies (metadata) to provide simultaneous cus-
tomer and product clustering, as well as multilevel views and summaries. It
also has a graphical user interface so one can interactively browse, split, or
merge a data set, which is of great help to speed up the iterations of analysis
during a data-mining project.
3.5 Experiments
3.5.1 Retail Market Basket Clusters
Let us illustrate clustering in a real retail transaction database of 21,672
customers of a drugstore. 5 For the illustrative purpose of this chapter,
we randomly selected 2500 customers. The total number of transactions
(cash-register scans) for these customers is 33,814 over three months. We
rolled up the product hierarchy once to obtain 1236 different products pur-
chased. Fifteen percept of the total revenue is contributed by the single item
Financial-Depts (on-site financial services such as check cashing and bill
payment), which was removed because it was too common. Of these, 473
products accounted for less than $25 each in total and were dropped. The
remaining n = 2466 customers (34 customers had empty baskets after remov-
ing the irrelevant products) with their d = 762 features were clustered using
Opossum
. The extended price was used as the feature entries to represent
purchased quantity weighted according to price.
In this customer clustering case study we set k = 20. In this application
domain, the number of clusters is often predetermined by marketing consider-
ations such as advertising industry standards, marketing budgets, marketers'
ability to handle multiple groups, and the cost of personalization. In general,
a reasonable value of k can be obtained using heuristics (Section 3.3.3).
5 provided by Knowledge Discovery 1.
Search WWH ::




Custom Search