Database Reference
In-Depth Information
1
Introduction
Without a doubt, pattern mining is one of the most important concepts in data mining.
In contrast to the traditional task of modeling data—where the goal is to describe all
of the data with one model—patterns describe only part of the data [ 27 ]. Of course,
many parts of the data, and hence many patterns, are not interesting at all. The goal
of pattern mining is to discover only those that are.
Which brings us to one of the core problem of pattern mining, and the topic of this
chapter: interestingness measures. Or, how to determine whether a given pattern is
interesting, and how to efficiently mine the interesting patterns from a given dataset.
In particular, we find many interesting research challenges in the combination of
these two problems.
Before we go into this dual, there is a key problem we have to address first:
interestingness is inherently subjective. That is, what is very interesting to one may
be nothing but a useless result to another. This goes both between different analysts
looking at the same data, but also between different data bases, as well as data mining
tasks. As such, we know that our lunch will not be free: there is not a single general
measure of interestingness that we can hope to formalize and will satisfy all. Instead,
we will have to define task specific interestingness measures.
Foregoing any difficulties in defining a measure that correctly identifies what we
find interesting, the second key problem is the exponentially large search space. That
is, there are exponentially many potentially interesting patterns. Naively evaluating
these one by one and only reporting those that meet the criteria is hence infeasible
for all but the most trivial of pattern languages [ 3 ]. As such, in addition to correctly
identifying what is interesting, ideally an interestingness measure also defines a
structured, easily traversable search space to find these patterns.
A big breakthrough in this regard was made in 1994 with the discovery by Agrawal
and Srikant, and independently by Mannila, Toivonen, and Verkamo [ 1 , 44 ], that the
frequency measure exhibits anti-monotonicity, a property frequently referred to as
the A Priori principle. In practice, this property allows to prune very large parts of the
search space, making it feasible to mine frequent patterns from very large databases.
In subsequent years, many highly efficient algorithms to this end were proposed
[ 78 , 76 , 26 ] (See also Chaps. 2 and 3).
Soon after the discovery of the A Priori principle people found that frequency
is not a very good measure for interestingness. In particular, people ran into the
so-called 'pattern explosion'. While for strict thresholds only patterns expressing
common knowledge were discovered, for non-trivial thresholds the exponential space
of patterns made that incredibly many patterns were returned as 'interesting'—many
of which only variations of the same theme.
In years since, many interestingness measures have been proposed in the literature
to tackle these problems; many for specialized tasks, pattern or data types, but we
also find highly general frameworks that attempt to approximate the ideal (subjective)
interestingness measure. In this chapter we aim to give an overview of the work done
in these respects. We will discuss a broad range of interestingness measures, as well
as how we can define efficient algorithms for extracting such patterns from data. In
order to keep the chapter focused and succinct we will restrict ourselves to measures
Search WWH ::




Custom Search