Clustering and Visualization of Retail Market Baskets - Advanced Techniques in Knowledge Discovery and Data Mining

Database Reference

In-Depth Information

dimensionality of the feature space. In such cases, the objects can be well

represented in a much lower-dimensional similarity space for further analy-

sis, and several recent works on classification have fruitfully exploited this

idea [3.10].

A similarity measure captures the relationship between two d-dimensional

objects in a single number (using on the order of nonzeros or d,atworst,com-

putations). Once this is done, the original high-dimensional space is not dealt

with at all; we only work in the transformed similarity space, and subsequent

processing is independent of d.

A similarity measure

[0, 1] captures how related two data points x a and

x b are. It should be symmetric (s( x a , x b )=s( x b , x a )), with self-similarity

s( x a , x a ) = 1. However, in general, similarit y functions (respectively their

∈

distance function equivalents δ = −

log(s)) do not obey the triangle in-

equality.

The suitability of a similarity measure depends on the nature (generative

model) of the data. A precise notion of a similarity between two points in

terms of the Fisher score is provided by information geometry arguments

[3.12], but this of course needs an appropriate parametric generative model

of the data. Let us now consider some popular similarity measures.

An obvious way to compute similarity is through a suitable monotonic

and inverse function of a Minkowski (L p ) distance, δ. Candidates include s =

1/(1 + δ)ands = e −δ 2 . Because maximizing average similarity (likelihood)

is equivalent to minimizing the average squared distance under a Gaussian

probability model, the latter formulation is preferable [3.13].

Similarity can also be defined by the cosine of the angle between two

vectors:

x a x b

s (C) ( x a , x b )=

.

(3.1)

x a 2 ·

x b 2

Cosine similarity is widely used in text clustering because two docu-

ments with the same proportions of term occurrences but different lengths

are often considered identical. Its normalization guards against domination

by longer vectors.

In retail data, such normalization loses important information about the

lifetime customer value, and we have recently shown that the extended Jaccard

similarity measure is more appropriate [3.13]. For binary features, the Jaccard

coe cient [3.14] measures the ratio of the intersection of the product sets to

the union of the product sets corresponding to transactions x a and x b ,each

having binary (0/1) elements:

x a x b

x a 2 + x b 2 − x a x b

s (J) ( x a , x b )=

.

(3.2)

Note that dimensions that have “0” entries for both vectors have no effect

on the measure. This is very helpful because transactional data are highly

sparse, so many zeros are encountered.

Advanced Techniques in Knowledge Discovery and Data Mining

Search WWH ::

Custom Search

Home