Database Reference
In-Depth Information
dimensionality of the feature space. In such cases, the objects can be well
represented in a much lower-dimensional similarity space for further analy-
sis, and several recent works on classification have fruitfully exploited this
idea [3.10].
A similarity measure captures the relationship between two d-dimensional
objects in a single number (using on the order of nonzeros or d,atworst,com-
putations). Once this is done, the original high-dimensional space is not dealt
with at all; we only work in the transformed similarity space, and subsequent
processing is independent of d.
A similarity measure
[0, 1] captures how related two data points x a and
x b are. It should be symmetric (s( x a , x b )=s( x b , x a )), with self-similarity
s( x a , x a ) = 1. However, in general, similarit y functions (respectively their
distance function equivalents δ =
log(s)) do not obey the triangle in-
equality.
The suitability of a similarity measure depends on the nature (generative
model) of the data. A precise notion of a similarity between two points in
terms of the Fisher score is provided by information geometry arguments
[3.12], but this of course needs an appropriate parametric generative model
of the data. Let us now consider some popular similarity measures.
An obvious way to compute similarity is through a suitable monotonic
and inverse function of a Minkowski (L p ) distance, δ. Candidates include s =
1/(1 + δ)ands = e −δ 2 . Because maximizing average similarity (likelihood)
is equivalent to minimizing the average squared distance under a Gaussian
probability model, the latter formulation is preferable [3.13].
Similarity can also be defined by the cosine of the angle between two
vectors:
x a x b
s (C) ( x a , x b )=
.
(3.1)
x a 2 ·
x b 2
Cosine similarity is widely used in text clustering because two docu-
ments with the same proportions of term occurrences but different lengths
are often considered identical. Its normalization guards against domination
by longer vectors.
In retail data, such normalization loses important information about the
lifetime customer value, and we have recently shown that the extended Jaccard
similarity measure is more appropriate [3.13]. For binary features, the Jaccard
coe cient [3.14] measures the ratio of the intersection of the product sets to
the union of the product sets corresponding to transactions x a and x b ,each
having binary (0/1) elements:
x a x b
x a 2 + x b 2 x a x b
s (J) ( x a , x b )=
.
(3.2)
Note that dimensions that have “0” entries for both vectors have no effect
on the measure. This is very helpful because transactional data are highly
sparse, so many zeros are encountered.
 
Search WWH ::




Custom Search