Frequent Pattern Mining in Data Streams - Frequent Pattern Mining

Database Reference

In-Depth Information

Preliminaries

In this section, we define the general problem of frequent pattern mining in stream-

ing data. We discuss some common variations of the task and we present popular

approaches to the simplest variation: frequent item mining.

2.1

Frequent Pattern Mining: Definition

Many of the existing surveys or reviews of frequent pattern mining have focused

exclusively on frequent itemset mining. We aim to offer a broader coverage, including

sequences, trees, and graphs among the types of data and patterns to be considered.

We start by giving a formal definition of the Frequent Pattern Mining problem.

Let

be the set of all possible data items x i . A pattern P is

a sequence or set of data items, with

{

x 1 , x 2 , ... , x m }

being all the possible patterns of interest.

A streaming dataset

The sequence is of indefinite length. At a time j , the data window T i , j is the finite data

subsequence from some earlier time i to the present: T i , j

is a sequence of transacted patterns, i.e.,

{

T 1 , T 2 , T 3 , ...

}

Because smaller patterns may be subsets of larger patterns, any data window may

contain numerous patterns. Let Patt ( T ) be a subpattern enumeration function which

generates the multiset of all patterns contained within T . The support s of a subpattern

P in dataset T is the frequency of p within T :

T i , T i + 1 , T i + 2 , ... , T j }

count ( P , Patt ( T ))

s ( P )

(9.1)

where count ( P , Patt ( T )) is the number of times that pattern P occurs in multiset

Patt ( T ). Then, for a given support threshold θ ,0 <θ< 1, a pattern P is a frequent

pattern of T iff s ( P )

θ . The Frequent Pattern Mining in Streaming Data

Problem seeks to find the set of all θ -frequent patterns P

≥

contained with a data

window T ij . A variant task seeks to find the Top- K Frequent Patterns , regardless

of support threshold.

This general model fits all the common types of patterns sought in streaming data:

itemsets, subsequences, subtrees, and subgraphs. Note that we have said that Patt

considers the whole data subsequence T i , j to enumerate subpatterns. However, in the

overwhelming majority of research works, we look for subpatterns only within each

individual data object. With this typical restriction, Patt ( T i , j )

∈ P

∀ T a ∈

T i , j . When the patterns of interest themselves are subsequences, there are a few works

[ 16 , 72 ] which combine adjacent data objects from the data stream to form candidate

patterns.

Let us see how this model fits the the most common application, frequent itemset

mining. Each object is an itemset.

In the example in Table 9.1 , each data object in the stream is an itemset. The

patterns being sought are frequent (sub)itemsets contained within each data object.

For example, T 1

=∪ Patt ( T a )

A , B , D , E

}

and Patt ( T 1 )

( A , B ), ( A , D ), ( A , E ), ( B , D ),

Frequent Pattern Mining

Search WWH ::

Custom Search

Home