Database Reference
In-Depth Information
database, L (
D 1 | M 1 ), and arrive at
L (
D 1 |
M 2 )
L (
D 1 |
M 1 )
ACLD (
D 1 , M 2 )
=
.
L (
D 1 | M 1 )
Like Kullback-Leibler divergence, ACLD is asymmetric: it measures how different
D 2 is from
D 1 , not vice versa. While it is reasonable to expect these to be in the same
ballpark, this is not a given.
5.2.2
The Database Dissimilarity Measure
The asymmetric measure allows measuring similarity of one database to an-
other. To make it a practical measure we would like it to be symmetric.
We do this by taking the maximum value of two aggregated differences, i.e.,
max
{
ACLD (
D 1 , M 2 ), ACLD (
D 2 , M 1 )
}
. This can easily be rewritten in terms of
compressed database sizes, as follows.
Definition 8.15
, and let M 1 and M 2
be their corresponding MDL-optimal models. Then, define the dissimilarity measure
DS between
Let
D 1 and
D 2 be two databases drawn from
T
D 1 and
D 2 as
max L (
.
D 1 |
M 2 )
L (
D 1 |
M 1 )
, L (
D 2 |
M 1 )
L (
D 2 |
M 2 )
DS (
D 1 ,
D 2 )
=
L (
D 1 |
M 1 )
L (
D 2 |
M 2 )
Using this measure, we'll obtain a score of 0 iff the databases are identical, and higher
scores indicate higher dissimilarity. In theory, using MDL-optimal models we find
that DS , like NCD [ 11 ] is a metric: the symmetry axiom holds by definition, scores
cannot be negative, and it holds that DS (
D 1 = D 2 . The advantage
of DS over NCD is that we only have to induce two models, as opposed to four.
For heuristic model induction algorithms the metric property is difficult to prove.
However, instantiating this measure for itemset data using KRIMP, we obtain very
good results [ 66 ]: dataset pairs drawn from the same distribution have very low
dissimilarities, whereas dataset pairs from different distributions have substantially
larger dissimilarities.
D 1 ,
D 2 )
=
0iff
5.3
Identifying and Characterizing Components
Though most databases are mixtures drawn from different distributions, we often as-
sume only one distribution. Clearly, this leads to suboptimal results: the distributions
need to be modeled individually.
Clustering addresses part of this problem by trying to separate the source com-
ponents that make up the mixture. However, as we do not know upfront what
distinguishes the different components, the appropriate distance metric is hard to
define. Furthermore, in clustering we are only returned the object assignment, and
Search WWH ::




Custom Search