Database Reference
In-Depth Information
Naïve Bayes and SVMs. This may be considered an unexpectedly positive result,
as the sole goal of each code table is to characterize and describe an individual
class-based database. The fact that these code tables can in practice also be used
for distinguishing samples drawn from the different distributions means they indeed
capture these very well.
5.2
A Dissimilarity Measure for Datasets
Comparing datasets to find and explain differences is a frequent task in many or-
ganizations. The two databases can, e.g., originate from different branches of the
same organizations, such as sales records from different stores of a chain or the
“same” database at different points in time. A first step towards identifying differences
between datasets is to quantify how different two datasets are.
Although this may appear to be a simple task at first sight, in practice it turns out
to be far from trivial in many cases. In particular, this is true when considering data
types for which no obvious distance measures are available, such as for categorical
data. In this subsection we describe a compression-based difference measure for
datasets (based on [ 66 ]).
5.2.1
Code Length Differences
Let
D 1 and
D 2 be two databases with tuples drawn from the same data universe
T
.
The MDL principle implies that the optimal compressor induced from a database
D 1
will generally provide shorter encodings for its tuples than the optimal compressor
induced from another database
D 2 . This is the same principle as used by the classifier
described in the previous subsection, and again we assume and exploit the tuple-level
compression property.
Formally, let M i be the optimal model induced from database
D i , and t a
transaction in
D 1 . Then, the MDL principle implies that
|
L ( t
|
M 2 )
L ( t
|
M 1 )
|
￿
is small if t is equally likely under the distributions of
D 1 and
D 2 ;
￿
is large if t is more likely under the distribution of one database than under the
distribution underlying the other.
Furthermore, the MDL principle implies that for the MDL-optimal models M 1 and
M 2 and t from
M 1 ) is positive. The
next step towards a dissimilarity measure is to aggregate these code length differences
over the dataset.
If we would do this naively, the resulting aggregate would depend on the size of
the data. To avoid this, we normalize by dividing by the 'native' encoded size of the
D 1 , the expected average value of L ( t
|
M 2 )
L ( t
|
Search WWH ::




Custom Search