Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

6.3.1

The Optimum

A more global issue is the efficiency of the used encodings. Whereas in Kolmogorov

complexity we have access to the ultimate algorithmic compressor, the MDL prin-

ciple assumes that we have access to the ultimate encoding. In practice, we have to

make do with an approximation. While when constructing an encoding we can make

principled choices, we often have to simplify matters to allow for fast(er) induction of

good models. For instance, in KRIMP it would be nice if we could encode transactions

using their exact probability given the pattern set. However, calculating frequencies

of an itemset given a set of itemsets and frequencies is known to be PP-hard [ 61 ].

Hence KRIMP uses a (admittedly crude) approximation of this ideal. A more efficient

encoding would allow to detect more fine-grained redundancy, and hence lead to

smaller and better models. Currently, however, there is very little known on how to

construct a good yet practical encoding.

A second global issue we need to point out is that of complexity. Intuitively,

optimizing an MDL score is rather complex. However, so far we only have hard-

ness results for a simple encoding in Boolean matrix factorization [ 46 ]. It may be

that other encodings do exhibit structure that we have not yet identified, but which

may be exploited for (more) efficient search. Alternatively, so far we have no the-

oretical results on the quality of our greedy approximations. It may be possible to

construct non-trivial MDL scores that exhibit sub-modularity, which would allow

approximating the quality of the greedy strategy.

Third, for now assuming the optimization problem is hard, and there are no (useful)

approximation guarantees, we need to develop smart heuristics. We described the two

main approaches proposed so far, candidate filtering and direct mining. Naively, the

larger part of the search space

we consider, the better the model M we'll be able

to find. However, as the model space is too large, we have to find ways of efficiently

considering what is good. The direct mining approach provides a promising direction,

but is only as good as the quality estimation it employs. Improving this estimation

will allow to prune away more candidates, and concentrate our effort there where it

matters most.

M

7

Conclusions

We discussed how to apply the MDL principle for mining sets of patterns that are both

informative and useful. In particular, we discussed how pattern-based models can be

designed and selected by means of compression, giving us succinct and characteristic

descriptions of the data.

Firmly rooted in algorithmic information theory, the approach taken in this chapter

states that the best set of patterns is that set that compresses the data best. We

formalized this problem using MDL, described model classes that can be used to this

end, and briefly discussed algorithmic approaches to inducing good models from

data. Last but not least, we described how the obtained models, which are very

characteristic for the data, can be used for numerous data mining tasks, making the

pattern sets practically useful.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home