Database Reference
In-Depth Information
2
Preliminaries
In this section, we define the general problem of frequent pattern mining in stream-
ing data. We discuss some common variations of the task and we present popular
approaches to the simplest variation: frequent item mining.
2.1
Frequent Pattern Mining: Definition
Many of the existing surveys or reviews of frequent pattern mining have focused
exclusively on frequent itemset mining. We aim to offer a broader coverage, including
sequences, trees, and graphs among the types of data and patterns to be considered.
We start by giving a formal definition of the Frequent Pattern Mining problem.
Let
be the set of all possible data items x i . A pattern P is
a sequence or set of data items, with
X
=
{
x 1 , x 2 , ... , x m }
P
being all the possible patterns of interest.
A streaming dataset
.
The sequence is of indefinite length. At a time j , the data window T i , j is the finite data
subsequence from some earlier time i to the present: T i , j
T
is a sequence of transacted patterns, i.e.,
T
=
{
T 1 , T 2 , T 3 , ...
}
.
Because smaller patterns may be subsets of larger patterns, any data window may
contain numerous patterns. Let Patt ( T ) be a subpattern enumeration function which
generates the multiset of all patterns contained within T . The support s of a subpattern
P in dataset T is the frequency of p within T :
={
T i , T i + 1 , T i + 2 , ... , T j }
count ( P , Patt ( T ))
|
s ( P )
=
(9.1)
T
|
where count ( P , Patt ( T )) is the number of times that pattern P occurs in multiset
Patt ( T ). Then, for a given support threshold θ ,0 <θ< 1, a pattern P is a frequent
pattern of T iff s ( P )
θ . The Frequent Pattern Mining in Streaming Data
Problem seeks to find the set of all θ -frequent patterns P
contained with a data
window T ij . A variant task seeks to find the Top- K Frequent Patterns , regardless
of support threshold.
This general model fits all the common types of patterns sought in streaming data:
itemsets, subsequences, subtrees, and subgraphs. Note that we have said that Patt
considers the whole data subsequence T i , j to enumerate subpatterns. However, in the
overwhelming majority of research works, we look for subpatterns only within each
individual data object. With this typical restriction, Patt ( T i , j )
P
T a
T i , j . When the patterns of interest themselves are subsequences, there are a few works
[ 16 , 72 ] which combine adjacent data objects from the data stream to form candidate
patterns.
Let us see how this model fits the the most common application, frequent itemset
mining. Each object is an itemset.
In the example in Table 9.1 , each data object in the stream is an itemset. The
patterns being sought are frequent (sub)itemsets contained within each data object.
For example, T 1
=∪ Patt ( T a )
={
A , B , D , E
}
and Patt ( T 1 )
={
( A , B ), ( A , D ), ( A , E ), ( B , D ),
Search WWH ::




Custom Search