Database Reference
In-Depth Information
the optimal code length for a given symbol in a string is equal to the
log of the
probability of observing it in the string [ 12 ].
As such, an alternate interpretation of MDL is to interpret L (
D |
M )asthe
(negative) log-likelihood of the data under the model,
log Pr (
D |
M ), and to regard
L ( M ), as the negative log-likelihood of the model,
log Pr ( M ), or, a regularization
function. Hence, looking for the model that gives the best compression is similar to
looking for the maximum likelihood model under a budget. As such it has a similar
shape to Akaike's Information Criterion (AIC) [ 2 ] and the Bayesian Information
Criterion (BIC) [ 53 ]. This of course assumes that there is a distribution for models,
as well as that we have a generative model for data that can be parameterized by M .
This is often not the case.
In MDL, however, we are concerned with descriptive models—not necessarily
generative ones. As such, different from Bayesian learning, in both Kolmogorov
complexity and MDL we evaluate only the data and explicit model at hand—we
do not 'average' over all models, so to speak, and hence do not need access to a
generative model. Moreover, MDL is different in that it requires a complete, lossless
encoding of both the model and the data while BIC and AIC penalize models based
only on the number of parameters.
In practice, while (refined) MDL and BIC are asymptotically the same, the two
may differ (strongly) on finite data samples. Typically, MDL is a bit more conserva-
tive. For a detailed discussion on the differences between BIC and MDL we refer to
Grünwald [ 20 ].
Using MDL in Practice To use MDL in practice, one has to define the model class
M
describes a database, and how all of this is encoded
in bits. That is, we have to define a compression scheme. In addition, we need an
algorithm to mine—or approximate—the optimal model.
A key advantage of MDL is that it removes the need for user-defined parameters:
the best model minimizes the total encoded size. Unfortunately, there are also disad-
vantages: (1) contrary to Kolmogorov Complexity, a model class needs to be defined
in advance, and (2) finding the optimal model is often practically infeasible. Conse-
quently, important design choices have to be made, and this is one of the challenges
of the compression-based approach to exploratory data mining.
A standard question regarding the use of MDL concerns the requirement of a
lossless encoding: if the goal is to find very short descriptions, why not use a lossy
encoding? The answer is two-fold.
First, and foremost, lossless encoding ensures fair comparison between models:
we know that every model is evaluated based on how well it describes the complete
dataset. With lossy compression, this is not the case: two models could have the same
L (
, how a single model M M
, M )—one describing only a small part of the data in high detail, and the other
describing all the data in low detail—and unlike for lossless compression, we would
have no (principled) way of choosing which one is best.
Second of all, we should point out that compression is not the goal, but only
a means to select the best model. By MDL, the best model provides the shortest
D
Search WWH ::




Custom Search