Database Reference
In-Depth Information
6.3.1
The Optimum
A more global issue is the efficiency of the used encodings. Whereas in Kolmogorov
complexity we have access to the ultimate algorithmic compressor, the MDL prin-
ciple assumes that we have access to the ultimate encoding. In practice, we have to
make do with an approximation. While when constructing an encoding we can make
principled choices, we often have to simplify matters to allow for fast(er) induction of
good models. For instance, in KRIMP it would be nice if we could encode transactions
using their exact probability given the pattern set. However, calculating frequencies
of an itemset given a set of itemsets and frequencies is known to be PP-hard [ 61 ].
Hence KRIMP uses a (admittedly crude) approximation of this ideal. A more efficient
encoding would allow to detect more fine-grained redundancy, and hence lead to
smaller and better models. Currently, however, there is very little known on how to
construct a good yet practical encoding.
A second global issue we need to point out is that of complexity. Intuitively,
optimizing an MDL score is rather complex. However, so far we only have hard-
ness results for a simple encoding in Boolean matrix factorization [ 46 ]. It may be
that other encodings do exhibit structure that we have not yet identified, but which
may be exploited for (more) efficient search. Alternatively, so far we have no the-
oretical results on the quality of our greedy approximations. It may be possible to
construct non-trivial MDL scores that exhibit sub-modularity, which would allow
approximating the quality of the greedy strategy.
Third, for now assuming the optimization problem is hard, and there are no (useful)
approximation guarantees, we need to develop smart heuristics. We described the two
main approaches proposed so far, candidate filtering and direct mining. Naively, the
larger part of the search space
we consider, the better the model M we'll be able
to find. However, as the model space is too large, we have to find ways of efficiently
considering what is good. The direct mining approach provides a promising direction,
but is only as good as the quality estimation it employs. Improving this estimation
will allow to prune away more candidates, and concentrate our effort there where it
matters most.
M
7
Conclusions
We discussed how to apply the MDL principle for mining sets of patterns that are both
informative and useful. In particular, we discussed how pattern-based models can be
designed and selected by means of compression, giving us succinct and characteristic
descriptions of the data.
Firmly rooted in algorithmic information theory, the approach taken in this chapter
states that the best set of patterns is that set that compresses the data best. We
formalized this problem using MDL, described model classes that can be used to this
end, and briefly discussed algorithmic approaches to inducing good models from
data. Last but not least, we described how the obtained models, which are very
characteristic for the data, can be used for numerous data mining tasks, making the
pattern sets practically useful.
Search WWH ::




Custom Search