Database Reference
In-Depth Information
MDL allows us to unambiguously identify the best set of patterns as that set that
provides the best lossless compression of the data. This provides us with a means to
mine small sets of patterns that describe the distribution of the data very well: if the
pattern set at hand would contain a pattern that describes noise, or that is redundant
with regard to the rest, removing it from the set will improve compression. As such,
the MDL optimal pattern set automatically balances the quality of fit of the data with
the complexity of the model—without the user having to set any parameters, as all
we have to do is minimize the encoding cost.
In this chapter we will give an overview of how MDL—or, compression—can be
used towards mining informative pattern sets, as well as for how to use these patterns
in a wide range of data mining tasks.
In a nutshell, we first discuss the necessary theoretical foundations in Sect. 2. In
Sect. 3 we then use this theory to discuss constructing pattern-based models we can
use with MDL. Section 4 covers the main approaches for mining good pattern sets,
and in Sect. 5 we discuss a range of data mining tasks that pattern-based compression
solves. We discuss open challenges in Sect. 6, and conclude in Sect. 7.
2
Foundations
Before we go into the specifics of MDL for pattern mining, we will have to discuss
some foundational theory.
Above, we stated that intuitively our goal is to find patterns that describe interesting
structure of the data—and want to avoid patterns that overfit, that describe noise.
This raises the questions, what is significant structure, and where does structure stop
and noise begin?
Statistics offers a wide range of tests to determine whether a result is significant,
including via Bayesian approaches such as calculating confidence intervals, as well
as frequentist approaches such as significance testing [ 17 ]. Loosely speaking, these
require us to assume a background distribution or null hypothesis, and use different
machinery to evaluate how likely the observed structure is under this assumption.
While a highly successful approach for confirming findings in science, in our
exploratory setting this raises three serious problems. First and foremost, there are
no off-the-shelf distributions for data and patterns that we can test against. Second,
even if we could, by assuming a distribution we strongly influence which results
will be deemed significant—a wrong choice will lead to meaningless results. Third,
the choice for the significance or confidence thresholds is arbitrary, yet strongly
influences the outcome. We want to avoid such far-reaching choices.
These problems were acknowledged by Ray Solomonoff, Andrey Kolmogorov,
and Gregory Chaitin, whom independently invented and contributed to what is now
known as algorithmic information theory [ 12 ]. In a nutshell, instead of using the
probability under a distribution, in algorithmic information theory we consider the
algorithmic complexity of the data. That is, to measure the amount of information
the data contains by the amount of algorithmic 'effort' is required to generate the
data using a universal Turing machine. There are different ways of formalizing such
'effort'. Here, we focus on Kolmogorov complexity.
Search WWH ::




Custom Search