Interesting Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

each transaction t is a subset of I . The pattern language

consists of itemsets, i.e.,

again sets of items. The support count of an itemset X in D is defined as the number

of transactions that contain X , i.e., supp D ( X )

=|{ t ∈ D | X ⊆ t }|

. We write fr D ( X )

to denote the relative support of X in D , i.e., fr D ( X )

supp D ( X ) /

.Wedonot

write D wherever clear from context.

The 'interestingness' predicate is a threshold on the support of the itemsets, the

minimal support: minsup . In other words, the task in frequent set mining is to

compute

{

∈ L |

supp D ( X )

≥

minsup

}

The itemsets in the result are called frequent itemsets.

Intuition The intuition behind this measure is simple: the more often an itemset

occurs in the data, the more interesting it is.

Frequent itemset mining was originally not a goal on itself, but merely a necessary

step in order to mine association rules [ 3 ]. There, the task is to discover rules X

→

Y ,

where X and Y are itemsets with X

∩

=∅

, such that when itemset X is a subset

ofarow t

t . Such

rules express associations, possibly correlations, and can hence be useful in many

applications. A main motivation was supermarket basket analysis, the idea being that

by advertising X , people will also buy more of Y .

The basic strategy for mining association rules is to first mine frequent patterns,

and then consider all partitionings of each frequent itemset Z into non-overlapping

subsets X and Y , to form candidate rules X → Y , while finally keeping only

those association rules that satisfy some quality threshold [ 3 ]. Though an interesting

research topic on itself, interestingness measures for association rules are beyond

the scope of this chapter. We refer the interested reader to the recent survey by Tew

et al. [ 69 ].

A Priori With a search space of 2 | I | patterns, the naive approach of evaluating

every pattern is infeasible. However, in 1994 it was discovered that support exhibits

monotonicity. That is, for two itemsets X and Y , we know

∈

D , X

⊂

t , with high confidence we will also see itemset Y

⊂

→

supp ( X )

≥

supp ( Y ),

which is known as the A Priori property [ 1 , 44 ], and allows for efficient search for

frequent itemsets over the lattice of all itemsets.

The A Priori algorithm was independently discovered by Agrawal and Srikant

[ 1 ], and by Mannila, Toivonen, and Verkamo [ 44 ]. It is a so-called candidate test

framework. Given a transaction database D over a set of items

and a support

threshold minsup, it first determines the set of singleton frequent itemsets

F 1

{

∈

supp ( i )

≥

minsup

}

. Then, given a set

F k of frequent patterns of length

C k + 1 of candidate frequent patterns of length k

k , we can construct the set

1, by

F k .We

then determine the supports of all candidates in one pass over the data, and obtain

F k + 1 by keeping only the candidates with supp ( X )

considering only itemsets that have all k sub-itemsets of length k included in

≥ minsup .

Frequent Pattern Mining

Search WWH ::

Custom Search

Home