Database Reference
In-Depth Information
1
Introduction
Frequent pattern mining is the search for frequently-occurring patterns within a
dataset. The dataset may be loosely structured, such as a set of text documents,
semistructured such as XML, or highly structured such as a graph. Each type of
data may be the source of a different type of pattern. For example, for a dataset
of purchase transactions, each transaction contains an itemset, so we may look for
frequent (sub)itemsets. In fact, frequent itemset mining is the most common pattern
mining application. Other important patterns include subsequences, subtrees, and
subgraphs.
Not only do frequent patterns describe the highlights of a dataset, providing key
insights into the data, but they also serve as a constituent for many other data mining
and machine learning tasks, such as association rule mining, classification, clustering,
and change detection [ 1 , 36 , 37 , 38 , 41 , 51 , 80 ].
The frequent itemset mining task gained wide attention in the data mining com-
munity with the publication of Agrawal and Srikant's Apriori algorithm [ 2 ] in 1994.
The next year, the pattern space was extended from itemsets to sequential patterns in
another seminal paper [ 3 ], also by Agrawal and Srikant. Since then, many efficient
frequent pattern algorithms have been developed [ 4 , 30 , 32 , 44 , 76 , 79 , 78 ]. A pop-
ular survey of frequent pattern mining algorithms is [ 33 ]. There algorithms assume
that the dataset is static, stored on disk, and that two or more passes over the dataset
may be taken.
In a streaming environment, however, a mining algorithm may take only a sin-
gle pass over the data [ 8 ]. The aforementioned algorithms at best only guarantee
an approximate result after one pass. Thus, the need for a new class of mining
techniques arose.
Compared with other stream processing tasks, frequent pattern mining presents
three computational challenges. First, there is generally an exponential number of
patterns to consider. For example, if we are seeking subsequences, a sequence of
length N contains 2 N possible subsequences. The classic Apriori-style algorithm
evaluates O ( k 2 ) candidate subpatterns in order to find one pattern of length k .How-
ever, if data are streaming in quickly, the computational complexity needs to be linear
or nearly so, in order to keep up with with newly arriving data.
Second, the memory requirements can also be substantial. Because the search
space is so large, the answering set itself may also be very large. To make matters
worse, many stream mining algorithms produce approximate results biased towards
false positive selections, so as not to miss any true positive results. Hence, a naïve
streaming data algorithm could require more memory than a static data algorithm.
Therefore, the mining algorithm needs to be very memory-efficient.
Third, the algorithms must balance the need for accuracy vs. the efficiency.
Reducing the error of the approximate results usually requires expending more
memory and more computational time, with diminishing returns. A good mining
algorithm should allow the user to adjust the balance between the accuracy and
computational resources.
In the last several years, researchers have introduced several new algorithms to
find frequent patterns over data streams. In this chapter, we will conduct a survey of
these algorithms.
Search WWH ::




Custom Search