Frequent Pattern Mining in Data Streams - Frequent Pattern Mining

Database Reference

In-Depth Information

groups: oount-based and sketch-based. FREQUENT is a simple but effect count-based

algorithm which addresses the Top- k Frequent Item problem. The idea was originally

published in 1982 by Misra and Gries [ 66 ] and then was rediscovered in 2002 [ 25 ].

The algorithm maintains up to k counters. For each item x , if it has already been

assigned a counter, then increment its count. Otherwise, if fewer than k counters are

currently used, then assign a oounter to x with value of 1. Otherwise, decrement all

the counts. Any count that drops to 0 is deassigned.

LOSSY COUNTING, by Manku and Motwani [ 61 ], computes an approximate answer

to the query for all θ -frequent patterns. The data stream is processed in batches of

size B

1 / ,0 << 1. For the n th batch, count all the incoming items and add

these to counts from previous batches. However, any item whose count is now less

than n is dropped from memory. As a consequence, Lossy Counting guarantees that

it tracks all items with support s ( x )

=

≥

and it undercounts the actual occurrences

by no more than

,so serves as an error parameter. We discuss this algorithm

in greater detail in the next section.

Sketch-based algorithms use a set of hash functions to project the counts for

every individual item onto a matrix of counters. By using multiple independent hash

functions and recording each item arrival at several matrix locations, there is a high

probability that we can retrieve a close estimate of the true frequency of a given item.

COUNTMIN [ 24 ] uses d hash functions h j and a matrix M with d rows of length w .

Each hash function maps an item to one of its w columns. When an item arrives, the

d different matrix elements are incremented. The estimated frequency for item x is

the minimum of its d corresponding count values:

n

f ( x )

= min 1 ≤ j ≤ d M [ j , h j ( x )]. If

f ( x ) has an error of at

we set sizes d

=

log 1 /δ and w

=

2 / , we guarantee that

most N with probability of at least 1

δ . The COUNT sketch algorithm by Charikar

et al. [ 17 ] is similar. It uses an additional set of hash functions to decide whether to

increment or decrement, and the estimate is the median of the d values rather than

the minimum.

−

3

Frequent Itemset Mining Algorithms

For static datasets, the classic method for finding frequent patterns is the Apriori

approach [ 2 ]. However, even for static data, the Apriori method is inefficient for

large datasets for two reasons: (1) it makes numerous passes over the data, and (2) it

starts with a potentially large number of small candidate itemsets. Due to the single-

pass constraint, most frequent pattern mining algorithms for streaming data produce

settle for an appropriate set of frequent patterns. That is, we do not for certain that

the result set is exactly equal to the true set of frequent patterns. The algorithms fall

into two categories: those that produce false positive results and those that produce

false negative results. A false positive algorithm guarantees that its result set includes

every true frequent pattern, but it may include some additional ones. A false negative

algorithm guarantees that every pattern it returns is frequent, but it may fail to detect

some true frequent ones.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home