Biomedical Engineering Reference
In-Depth Information
8.3. Statistical Measures of Complexity
The basic problem with algorithmic complexity and its extensions is that
they are all about finding the shortest way of exactly describing a single con-
figuration. Even if we could compute these measures, we might suspect, on the
basis of our discussion of over-fitting in §2 above, that this is not what we want.
Many of the details of any particular set of data are just noise, and will not gen-
eralize to other data sets obtained from the same system. If we want to charac-
terize the complexity of the system, it is precisely the generalizations that we
want, and not the noisy particulars. Looking at the sophistication, we saw the
idea of picking out, from the overall description, the part which describes the
regularities of the data. This idea becomes useful and operational when we
abandon the goal of exact description, and allow ourselves to recognize that the
world is full of noise, which is easy to describe statistically; we want a statisti-
cal, and not an algorithmic, measure of complexity.
I will begin with what is undoubtedly the most widely used statistical meas-
ure of complexity, Rissanen's stochastic complexity , which can also be consid-
ered a method of model selection. Then I will look at three attempts to isolate
the complexity of the system as such, by considering how much information
would be required to predict its behavior, if we had an optimal statistical model
of the system.
8.3.1.
Stochastic Complexity and the Minimum Description Length
Suppose we have a statistical model with some parameter R, and we observe
the data x . The model assigns a certain likelihood to the data, Pr R ( X = x ). One
can make this into a loss function by taking its negative logarithm: L (R, x ) = -log
Pr R ( X = x ). Maximum likelihood estimation minimizes this loss function. We
also learned, in §7, that if Pr R is the correct probability distribution, the optimal
coding scheme will use -log Pr R ( X = x ) bits to encode x . Thus, maximizing the
likelihood can also be thought of as minimizing the encoded length of the data.
However, we do not yet have a complete description: we have an encoded
version of the data, but we have not said what the encoding scheme, i.e., the
model, is. Thus, the total description length has two parts:
C ( x ,R,R) = L ( x ,R) + D (R,2),
[56]
where D (R,2) is the number of bits we need to specify R from among the set of
all our models 2. L ( x ,R) represents the "noisy" or arbitrary part of the descrip-
tion, the one which will not generalize; the model represents the part which does
generalize. If D (R,2) gives short codes to simple models, we have the desired
kind of tradeoff, where we can reduce the part of the data that looks like noise
Search WWH ::




Custom Search