Interesting Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

1

Introduction

Without a doubt, pattern mining is one of the most important concepts in data mining.

In contrast to the traditional task of modeling data—where the goal is to describe all

of the data with one model—patterns describe only part of the data [ 27 ]. Of course,

many parts of the data, and hence many patterns, are not interesting at all. The goal

of pattern mining is to discover only those that are.

Which brings us to one of the core problem of pattern mining, and the topic of this

chapter: interestingness measures. Or, how to determine whether a given pattern is

interesting, and how to efficiently mine the interesting patterns from a given dataset.

In particular, we find many interesting research challenges in the combination of

these two problems.

Before we go into this dual, there is a key problem we have to address first:

interestingness is inherently subjective. That is, what is very interesting to one may

be nothing but a useless result to another. This goes both between different analysts

looking at the same data, but also between different data bases, as well as data mining

tasks. As such, we know that our lunch will not be free: there is not a single general

measure of interestingness that we can hope to formalize and will satisfy all. Instead,

we will have to define task specific interestingness measures.

Foregoing any difficulties in defining a measure that correctly identifies what we

find interesting, the second key problem is the exponentially large search space. That

is, there are exponentially many potentially interesting patterns. Naively evaluating

these one by one and only reporting those that meet the criteria is hence infeasible

for all but the most trivial of pattern languages [ 3 ]. As such, in addition to correctly

identifying what is interesting, ideally an interestingness measure also defines a

structured, easily traversable search space to find these patterns.

A big breakthrough in this regard was made in 1994 with the discovery by Agrawal

and Srikant, and independently by Mannila, Toivonen, and Verkamo [ 1 , 44 ], that the

frequency measure exhibits anti-monotonicity, a property frequently referred to as

the A Priori principle. In practice, this property allows to prune very large parts of the

search space, making it feasible to mine frequent patterns from very large databases.

In subsequent years, many highly efficient algorithms to this end were proposed

[ 78 , 76 , 26 ] (See also Chaps. 2 and 3).

Soon after the discovery of the A Priori principle people found that frequency

is not a very good measure for interestingness. In particular, people ran into the

so-called 'pattern explosion'. While for strict thresholds only patterns expressing

common knowledge were discovered, for non-trivial thresholds the exponential space

of patterns made that incredibly many patterns were returned as 'interesting'—many

of which only variations of the same theme.

In years since, many interestingness measures have been proposed in the literature

to tackle these problems; many for specialized tasks, pattern or data types, but we

also find highly general frameworks that attempt to approximate the ideal (subjective)

interestingness measure. In this chapter we aim to give an overview of the work done

in these respects. We will discuss a broad range of interestingness measures, as well

as how we can define efficient algorithms for extracting such patterns from data. In

order to keep the chapter focused and succinct we will restrict ourselves to measures

Frequent Pattern Mining

Search WWH ::

Custom Search

Home