Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

2.1

Kolmogorov Complexity

Kolmogorov complexity measures the information content of a string s ; note that

any database

can be serialized into a string. The Kolmogorov complexity of s ,

K U ( s ), is defined as the length in bits of the shortest program p for a Universal Turing

machine

D

U

that generates s and then halts. Formally, we have

K U ( s )

=

min

x |

p

|

.

p :

U

( p )

=

Intuitively, program p can be regarded as the ultimate compressor of s .

Let us analyze what this entails. First of all, it is easy to see that every string s has

at least one program that generates it: the program p 0 that simply prints s verbatim.

Further, we know that if the string is fully random, there will be no shorter program

than p 0 . This gives us an upper bound. In fact, this allows us to define what structure

is, and what not. Namely, any (subset of) the data for which K ( s ) is smaller than the

length of p 0 exhibits structure—and the program p is the shortest description of this

structure.

Loosely speaking, the lower bound for K is zero, which will only be approximated

when the data s is very simple to express algorithmically. Examples include a long

series of one value, e.g., 000000000 ... , but also data that seems complex at first

glance, such as the first n digits of π , or a fractal, have in fact a very low Kolmogorov

complexity—which matches the intuition that, while the result may be complex, the

process for generating this data can indeed be relatively simple.

In fact, we can regard p as two parts; the 'algorithm' that describes the compress-

ible structure of s , and the 'input' to this algorithm that express the incompressible

part of s . Separating these two components in a given dataset is exactly the goal of

exploratory data analysis, and as such Kolmogorov Complexity institutes the ideal.

Sadly, however, K ( s ) is not computable. Apart from the fact that the space of possible

programs is enormous, we face the problem that p has to generate s and then halt.

By the halting problem we are unable to make that call.

This does not mean Kolmogorov complexity is useless. Quite the contrary, in

fact. While beyond the scope of this chapter, it provides the theoretical foundations

to many aspects of data analysis, statistics, data mining, and machine learning. We

refer the interested reader to Li and Vitány [ 40 ] for a detailed discussion on these

foundations.

Although Kolmogorov complexity itself is not computable, we can still put it

to practice by approximating it. With p we have the ultimate compressor, which

can exploit any structure present in s . The incomputability of K ( s ) stems from this

infinite 'vocabulary', as we have to consider all possible programs. We can, however,

constrain the family of programs we consider to a set for which we know they halt, by

limiting this vocabulary to a fixed set of regularities. In other words, by considering

lossless compression algorithms.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home