Frequent Pattern Mining in Data Streams - Frequent Pattern Mining

Database Reference

In-Depth Information

1

Introduction

Frequent pattern mining is the search for frequently-occurring patterns within a

dataset. The dataset may be loosely structured, such as a set of text documents,

semistructured such as XML, or highly structured such as a graph. Each type of

data may be the source of a different type of pattern. For example, for a dataset

of purchase transactions, each transaction contains an itemset, so we may look for

frequent (sub)itemsets. In fact, frequent itemset mining is the most common pattern

mining application. Other important patterns include subsequences, subtrees, and

subgraphs.

Not only do frequent patterns describe the highlights of a dataset, providing key

insights into the data, but they also serve as a constituent for many other data mining

and machine learning tasks, such as association rule mining, classification, clustering,

and change detection [ 1 , 36 , 37 , 38 , 41 , 51 , 80 ].

The frequent itemset mining task gained wide attention in the data mining com-

munity with the publication of Agrawal and Srikant's Apriori algorithm [ 2 ] in 1994.

The next year, the pattern space was extended from itemsets to sequential patterns in

another seminal paper [ 3 ], also by Agrawal and Srikant. Since then, many efficient

frequent pattern algorithms have been developed [ 4 , 30 , 32 , 44 , 76 , 79 , 78 ]. A pop-

ular survey of frequent pattern mining algorithms is [ 33 ]. There algorithms assume

that the dataset is static, stored on disk, and that two or more passes over the dataset

may be taken.

In a streaming environment, however, a mining algorithm may take only a sin-

gle pass over the data [ 8 ]. The aforementioned algorithms at best only guarantee

an approximate result after one pass. Thus, the need for a new class of mining

techniques arose.

Compared with other stream processing tasks, frequent pattern mining presents

three computational challenges. First, there is generally an exponential number of

patterns to consider. For example, if we are seeking subsequences, a sequence of

length N contains 2 N possible subsequences. The classic Apriori-style algorithm

evaluates O ( k 2 ) candidate subpatterns in order to find one pattern of length k .How-

ever, if data are streaming in quickly, the computational complexity needs to be linear

or nearly so, in order to keep up with with newly arriving data.

Second, the memory requirements can also be substantial. Because the search

space is so large, the answering set itself may also be very large. To make matters

worse, many stream mining algorithms produce approximate results biased towards

false positive selections, so as not to miss any true positive results. Hence, a naïve

streaming data algorithm could require more memory than a static data algorithm.

Therefore, the mining algorithm needs to be very memory-efficient.

Third, the algorithms must balance the need for accuracy vs. the efficiency.

Reducing the error of the approximate results usually requires expending more

memory and more computational time, with diminishing returns. A good mining

algorithm should allow the user to adjust the balance between the accuracy and

computational resources.

In the last several years, researchers have introduced several new algorithms to

find frequent patterns over data streams. In this chapter, we will conduct a survey of

these algorithms.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home