Database Reference
In-Depth Information
.A
pattern language is the set of all possible patterns that we can discover for a given
data type. In principle, a pattern can be any structure that describes the distribution
of (a subset of) the data. Given the topic of the topic, we focus on frequent patterns;
e.g., when we consider itemsets,
Given a dataset, on of the most important choices is the pattern language
X
X
can be the set of all possible itemsets, while for
structured data,
can consist of sequential patterns, subgraphs, etc.
Clearly, the choice of
X
is highly important, as it determines the type of structure
that we will be able to discover in the data. Another way of thinking about
X
is that
it defines the 'vocabulary' of the compressor. If one chooses a pattern language that
is highly specific, it may be impossible to find relevant structure of that type in the
data. On the other hand, if a very rich, i.e., more complex pattern language is chosen,
the encoding and search for the model can become rather complicated.
X
3.1
Pattern Models for MDL
Given a class of data and a pattern language, we can start to construct a pattern-based
model. Note that by defining a model class, we essentially fix the set of possible
models
M
D
. Given this space of
possible descriptions, we can employ the MDL principle to select the best model
M
, the possible descriptions, for a given dataset
simply by choosing the model that minimizes the total compressed
size. In order to do so, however, we need to be able to compute L (
M
for
D
D
, M ), however,
the encoded length of the model and the data given the model.
We start with the latter, i.e., we first formally define how to compute L (
D | M ),
the encoded length in bits of the data given the model. Generally speaking, there
are many different ways to describe the same data using one model. However, by
the MDL principle, our encoding should be such that we use the minimal amount of
bits to do so. This helps us to make principled choices when defining the encoding
scheme. Some of these may impose additional constraints and requirements on the
design of the compressor, as well as determine how the score can be used. This is
particularly important in light of subsequently using the pattern-based models in data
mining tasks other than summarization. Here we describe three important properties
that a compressor may have.
Dataset-level Compression At the highest level we need to be able to compare the
encoded size of different databases. The most trivial way to do so is by comparing
the total encoded size, L (
D
, M ), where we induce the MDL-optimal model M for
each
.
This property alone allows us to use compression as a 'black box': without paying
any attention to the contents of the models or how datasets are compressed, the MDL
principle can be used to select appropriate models for a given dataset. In fact, this
property does not even require datasets to consist of individual tuples that can be
distinguished, nor does it require models to consist of patterns.
D
Search WWH ::




Custom Search