Database Reference
In-Depth Information
Moreover, it allows us to use compression for data mining tasks, such as for
computing data dissimilarity. Note that this property generally holds for any generic
compressor, and therefore compression algorithms like those in ZIP, GZIP, BZIP, etc,
can also be used for such tasks. As a concrete example, the family of Normalized
Compression Distance measures [ 41 ] rely on this.
As a slight variant, we can also fix the model M and see how well it compresses
another dataset. That is, we require that L (
D |
M ) is explicitly calculable for any
D
of the specified data universe
. This allows us to calculate how
well a dataset matches the distribution of another dataset. See also Sect. 5.
T
and any M
M
Tuple-level Compression In addition, a rather useful property is when each tuple
t
D
can be compressed individually, independent of all other tuples. That is,
L ( t
|
M ) can be computed for a given M . This also implies we can simply calculate
L (
D |
M )as
L (
D |
M )
=
L ( t
|
M ) .
t D
This property simplifies many aspects related to the induction and usage of the
models. For example, as a consequence, calculating the encoded size can now be
trivially parallelized. More important, though, is that common data mining tasks
such as classification and clustering are now straightforward. More on this later.
Pattern-level Inspection The third and final property that we discuss here is that of
sub-tuple ,or, pattern-level inspection. That is, beyond computing L ( t
M ) we also
want to be able to inspect how a given tuple is encoded: what structure, in the form
of a set of patterns, is used to compress it?
With this property, it becomes possible to provide explanations for certain out-
comes (e.g., explain why is a certain tuple compressed better by one model than by
another), but also to exploit this information to improve the model (e.g., patterns
that often occur together in a tuple should probably be combined). Effectively, it is
this property that makes pattern-based solutions so powerful, as it ensures that in
addition to decisions, we can offer explanations .
|
3.2
Code Tables
The conceptually most simple, as well as most commonly used pattern-based model
for MDL are so-called code tables (see e.g., [ 23 , 56 , 57 , 59 , 64 , 68 ]). Informally, a
code table is a dictionary, a translation table between patterns and codes. Each entry
in the left column contains a pattern and corresponds to exactly one code word in
the right column. Such a code table can be used to compress the data by replacing
occurrences of patterns with their corresponding codes, and vice versa to decode
an encoded dataset and reconstruct the original the data. Using the MDL principle,
the problem can then be formulated as finding that code table that gives the best
compression.
Search WWH ::




Custom Search