Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

the optimal code length for a given symbol in a string is equal to the

−

log of the

probability of observing it in the string [ 12 ].

As such, an alternate interpretation of MDL is to interpret L (

D |

M )asthe

(negative) log-likelihood of the data under the model,

−

log Pr (

D |

M ), and to regard

L ( M ), as the negative log-likelihood of the model,

log Pr ( M ), or, a regularization

function. Hence, looking for the model that gives the best compression is similar to

looking for the maximum likelihood model under a budget. As such it has a similar

shape to Akaike's Information Criterion (AIC) [ 2 ] and the Bayesian Information

Criterion (BIC) [ 53 ]. This of course assumes that there is a distribution for models,

as well as that we have a generative model for data that can be parameterized by M .

This is often not the case.

In MDL, however, we are concerned with descriptive models—not necessarily

generative ones. As such, different from Bayesian learning, in both Kolmogorov

complexity and MDL we evaluate only the data and explicit model at hand—we

do not 'average' over all models, so to speak, and hence do not need access to a

generative model. Moreover, MDL is different in that it requires a complete, lossless

encoding of both the model and the data while BIC and AIC penalize models based

only on the number of parameters.

In practice, while (refined) MDL and BIC are asymptotically the same, the two

may differ (strongly) on finite data samples. Typically, MDL is a bit more conserva-

tive. For a detailed discussion on the differences between BIC and MDL we refer to

Grünwald [ 20 ].

−

Using MDL in Practice To use MDL in practice, one has to define the model class

M

describes a database, and how all of this is encoded

in bits. That is, we have to define a compression scheme. In addition, we need an

algorithm to mine—or approximate—the optimal model.

A key advantage of MDL is that it removes the need for user-defined parameters:

the best model minimizes the total encoded size. Unfortunately, there are also disad-

vantages: (1) contrary to Kolmogorov Complexity, a model class needs to be defined

in advance, and (2) finding the optimal model is often practically infeasible. Conse-

quently, important design choices have to be made, and this is one of the challenges

of the compression-based approach to exploratory data mining.

A standard question regarding the use of MDL concerns the requirement of a

lossless encoding: if the goal is to find very short descriptions, why not use a lossy

encoding? The answer is two-fold.

First, and foremost, lossless encoding ensures fair comparison between models:

we know that every model is evaluated based on how well it describes the complete

dataset. With lossy compression, this is not the case: two models could have the same

L (

, how a single model M ∈ M

, M )—one describing only a small part of the data in high detail, and the other

describing all the data in low detail—and unlike for lossless compression, we would

have no (principled) way of choosing which one is best.

Second of all, we should point out that compression is not the goal, but only

a means to select the best model. By MDL, the best model provides the shortest

D

Frequent Pattern Mining

Search WWH ::

Custom Search

Home