Database Reference
In-Depth Information
2.1
Kolmogorov Complexity
Kolmogorov complexity measures the information content of a string s ; note that
any database
can be serialized into a string. The Kolmogorov complexity of s ,
K U ( s ), is defined as the length in bits of the shortest program p for a Universal Turing
machine
D
U
that generates s and then halts. Formally, we have
K U ( s )
=
min
x |
p
|
.
p :
U
( p )
=
Intuitively, program p can be regarded as the ultimate compressor of s .
Let us analyze what this entails. First of all, it is easy to see that every string s has
at least one program that generates it: the program p 0 that simply prints s verbatim.
Further, we know that if the string is fully random, there will be no shorter program
than p 0 . This gives us an upper bound. In fact, this allows us to define what structure
is, and what not. Namely, any (subset of) the data for which K ( s ) is smaller than the
length of p 0 exhibits structure—and the program p is the shortest description of this
structure.
Loosely speaking, the lower bound for K is zero, which will only be approximated
when the data s is very simple to express algorithmically. Examples include a long
series of one value, e.g., 000000000 ... , but also data that seems complex at first
glance, such as the first n digits of π , or a fractal, have in fact a very low Kolmogorov
complexity—which matches the intuition that, while the result may be complex, the
process for generating this data can indeed be relatively simple.
In fact, we can regard p as two parts; the 'algorithm' that describes the compress-
ible structure of s , and the 'input' to this algorithm that express the incompressible
part of s . Separating these two components in a given dataset is exactly the goal of
exploratory data analysis, and as such Kolmogorov Complexity institutes the ideal.
Sadly, however, K ( s ) is not computable. Apart from the fact that the space of possible
programs is enormous, we face the problem that p has to generate s and then halt.
By the halting problem we are unable to make that call.
This does not mean Kolmogorov complexity is useless. Quite the contrary, in
fact. While beyond the scope of this chapter, it provides the theoretical foundations
to many aspects of data analysis, statistics, data mining, and machine learning. We
refer the interested reader to Li and Vitány [ 40 ] for a detailed discussion on these
foundations.
Although Kolmogorov complexity itself is not computable, we can still put it
to practice by approximating it. With p we have the ultimate compressor, which
can exploit any structure present in s . The incomputability of K ( s ) stems from this
infinite 'vocabulary', as we have to consider all possible programs. We can, however,
constrain the family of programs we consider to a set for which we know they halt, by
limiting this vocabulary to a fixed set of regularities. In other words, by considering
lossless compression algorithms.
Search WWH ::




Custom Search