METHODS AND TECHNIQUES OF COMPLEX SYSTEMS SCIENCE: AN OVERVIEW - Complex Systems Science in Biomedicine

Biomedical Engineering Reference

In-Depth Information

8.3. Statistical Measures of Complexity

The basic problem with algorithmic complexity and its extensions is that

they are all about finding the shortest way of exactly describing a single con-

figuration. Even if we could compute these measures, we might suspect, on the

basis of our discussion of over-fitting in §2 above, that this is not what we want.

Many of the details of any particular set of data are just noise, and will not gen-

eralize to other data sets obtained from the same system. If we want to charac-

terize the complexity of the system, it is precisely the generalizations that we

want, and not the noisy particulars. Looking at the sophistication, we saw the

idea of picking out, from the overall description, the part which describes the

regularities of the data. This idea becomes useful and operational when we

abandon the goal of exact description, and allow ourselves to recognize that the

world is full of noise, which is easy to describe statistically; we want a statisti-

cal, and not an algorithmic, measure of complexity.

I will begin with what is undoubtedly the most widely used statistical meas-

ure of complexity, Rissanen's stochastic complexity , which can also be consid-

ered a method of model selection. Then I will look at three attempts to isolate

the complexity of the system as such, by considering how much information

would be required to predict its behavior, if we had an optimal statistical model

of the system.

8.3.1.

Stochastic Complexity and the Minimum Description Length

Suppose we have a statistical model with some parameter R, and we observe

the data x . The model assigns a certain likelihood to the data, Pr R ( X = x ). One

can make this into a loss function by taking its negative logarithm: L (R, x ) = -log

Pr R ( X = x ). Maximum likelihood estimation minimizes this loss function. We

also learned, in §7, that if Pr R is the correct probability distribution, the optimal

coding scheme will use -log Pr R ( X = x ) bits to encode x . Thus, maximizing the

likelihood can also be thought of as minimizing the encoded length of the data.

However, we do not yet have a complete description: we have an encoded

version of the data, but we have not said what the encoding scheme, i.e., the

model, is. Thus, the total description length has two parts:

C ( x ,R,R) = L ( x ,R) + D (R,2),

[56]

where D (R,2) is the number of bits we need to specify R from among the set of

all our models 2. L ( x ,R) represents the "noisy" or arbitrary part of the descrip-

tion, the one which will not generalize; the model represents the part which does

generalize. If D (R,2) gives short codes to simple models, we have the desired

kind of tradeoff, where we can reduce the part of the data that looks like noise

Search WWH ::

Custom Search

Home