Database Reference
In-Depth Information
Chapter 8
Mining and Using Sets of Patterns through
Compression
Matthijs van Leeuwen and Jilles Vreeken
Abstract In this chapter we describe how to successfully apply the MDL principle
to pattern mining. In particular, we discuss how pattern-based models can be de-
signed and induced by means of compression, resulting in succinct and characteristic
descriptions of the data.
As motivation, we argue that traditional pattern mining asks the wrong question:
instead of asking for all patterns satisfying some interestingness measure, one should
ask for a small, non-redundant, and interesting set of patterns—which allows us
to avoid the pattern explosion. Firmly rooted in algorithmic information theory,
the approach we discuss in this chapter states that the best set of patterns is that
set that compresses the data best. We formalize this problem using the Minimum
Description Length (MDL) principle, describe useful model classes, and briefly
discuss algorithmic approaches to inducing good models from data. Last but not
least, we describe how the obtained models—in addition to showing the key patterns
of the data—can be used for a wide range of data mining tasks; hence showing that
MDL selects useful patterns.
Keywords Compression
·
MDL
·
Pattern set mining
·
Data summarization
1
Introduction
What is the ideal outcome of pattern mining? Which patterns would we really like
to find? Obviously, this depends on the task at hand, and possibly even on the user.
When we are exploring the data for new insights the ideal outcome will be different
than when the goal is to build a good pattern-based classifier.
There are, however, a few important general observations to be made. For starters,
we are not interested in patterns that describe noise—we only want patterns that
identify important associations in the data. In pattern mining, the function that usually
Search WWH ::




Custom Search