Database Reference
In-Depth Information
The extended Jaccard coe cient is also given by Eq. (3.2), but it allows
elements of x a and x b to be arbitrary positive real numbers. This coe cient
captures a vector-length-sensitive measure of similarity. However, it is still
invariant to scale (dilating x a and x b by the same factor does not change
s( x a , x b )). A detailed discussion of the properties of various similarity mea-
sures can be found in [3.13], where it is shown that the extended Jaccard
coe cient is particularly well-suited for market basket data.
Because for general data distributions, one cannot avoid the “curse of di-
mensionality,” there is no similarity metric that is optimal for all applications.
Rather, one needs to determine an appropriate measure for the given applica-
tion that captures the essential aspects of the class of high-dimensional data
distributions being considered.
3.3 OPOSSUM
In this section, we present
(Optimal Partitioning of Sparse Similari-
ties Using Metis), a similarity-based clustering technique particularly tailored
to market basket data. Opossum differs from other graph-based clustering
techniques by application-driven balancing of clusters, nonmetric similarity
measures, and visualization-driven heuristics for finding an appropriate k.
Opossum
3.3.1 Balancing
Typically, one segments transactional data into five to twenty groups, each of
which should be of comparable importance. Balancing avoids trivial cluster-
ings (e.g., k
1 singletons and one big cluster). More importantly, the desired
balancing properties have many application-driven advantages. For exam-
ple, when each cluster contains the same number of customers, discovered
phenomena (e.g., frequent products, co-purchases) have equal significance or
support and are thus easier to evaluate. When each customer cluster equals
the same revenue share, marketing can spend an equal amount of attention
and budget for each of the groups.
Opossum
strives to deliver “balanced”
clusters using one of the following two criteria:
- Sample Balanced: Each cluster should contain roughly the same number
of samples, n/k. This allows, for example, retail marketers to obtain a
customer segmentation with comparably sized customer groups.
- Value Balanced: Each cluster should contain roughly the same number of
feature values. Thus, a cluster represents a kth fraction of the total feature
value v = j=1 i=1 x i,j . In customer clustering, we use extended price
per product as features and thus each cluster represents a roughly equal
contribution to total revenue.
Search WWH ::




Custom Search