Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

MDL allows us to unambiguously identify the best set of patterns as that set that

provides the best lossless compression of the data. This provides us with a means to

mine small sets of patterns that describe the distribution of the data very well: if the

pattern set at hand would contain a pattern that describes noise, or that is redundant

with regard to the rest, removing it from the set will improve compression. As such,

the MDL optimal pattern set automatically balances the quality of fit of the data with

the complexity of the model—without the user having to set any parameters, as all

we have to do is minimize the encoding cost.

In this chapter we will give an overview of how MDL—or, compression—can be

used towards mining informative pattern sets, as well as for how to use these patterns

in a wide range of data mining tasks.

In a nutshell, we first discuss the necessary theoretical foundations in Sect. 2. In

Sect. 3 we then use this theory to discuss constructing pattern-based models we can

use with MDL. Section 4 covers the main approaches for mining good pattern sets,

and in Sect. 5 we discuss a range of data mining tasks that pattern-based compression

solves. We discuss open challenges in Sect. 6, and conclude in Sect. 7.

2

Foundations

Before we go into the specifics of MDL for pattern mining, we will have to discuss

some foundational theory.

Above, we stated that intuitively our goal is to find patterns that describe interesting

structure of the data—and want to avoid patterns that overfit, that describe noise.

This raises the questions, what is significant structure, and where does structure stop

and noise begin?

Statistics offers a wide range of tests to determine whether a result is significant,

including via Bayesian approaches such as calculating confidence intervals, as well

as frequentist approaches such as significance testing [ 17 ]. Loosely speaking, these

require us to assume a background distribution or null hypothesis, and use different

machinery to evaluate how likely the observed structure is under this assumption.

While a highly successful approach for confirming findings in science, in our

exploratory setting this raises three serious problems. First and foremost, there are

no off-the-shelf distributions for data and patterns that we can test against. Second,

even if we could, by assuming a distribution we strongly influence which results

will be deemed significant—a wrong choice will lead to meaningless results. Third,

the choice for the significance or confidence thresholds is arbitrary, yet strongly

influences the outcome. We want to avoid such far-reaching choices.

These problems were acknowledged by Ray Solomonoff, Andrey Kolmogorov,

and Gregory Chaitin, whom independently invented and contributed to what is now

known as algorithmic information theory [ 12 ]. In a nutshell, instead of using the

probability under a distribution, in algorithmic information theory we consider the

algorithmic complexity of the data. That is, to measure the amount of information

the data contains by the amount of algorithmic 'effort' is required to generate the

data using a universal Turing machine. There are different ways of formalizing such

'effort'. Here, we focus on Kolmogorov complexity.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home