Database Reference
In-Depth Information
groups: oount-based and sketch-based. FREQUENT is a simple but effect count-based
algorithm which addresses the Top- k Frequent Item problem. The idea was originally
published in 1982 by Misra and Gries [ 66 ] and then was rediscovered in 2002 [ 25 ].
The algorithm maintains up to k counters. For each item x , if it has already been
assigned a counter, then increment its count. Otherwise, if fewer than k counters are
currently used, then assign a oounter to x with value of 1. Otherwise, decrement all
the counts. Any count that drops to 0 is deassigned.
LOSSY COUNTING, by Manku and Motwani [ 61 ], computes an approximate answer
to the query for all θ -frequent patterns. The data stream is processed in batches of
size B
1 / ,0 << 1. For the n th batch, count all the incoming items and add
these to counts from previous batches. However, any item whose count is now less
than n is dropped from memory. As a consequence, Lossy Counting guarantees that
it tracks all items with support s ( x )
=
and it undercounts the actual occurrences
by no more than
,so serves as an error parameter. We discuss this algorithm
in greater detail in the next section.
Sketch-based algorithms use a set of hash functions to project the counts for
every individual item onto a matrix of counters. By using multiple independent hash
functions and recording each item arrival at several matrix locations, there is a high
probability that we can retrieve a close estimate of the true frequency of a given item.
COUNTMIN [ 24 ] uses d hash functions h j and a matrix M with d rows of length w .
Each hash function maps an item to one of its w columns. When an item arrives, the
d different matrix elements are incremented. The estimated frequency for item x is
the minimum of its d corresponding count values:
n
f ( x )
= min 1 j d M [ j , h j ( x )]. If
f ( x ) has an error of at
we set sizes d
=
log 1 and w
=
2 / , we guarantee that
most N with probability of at least 1
δ . The COUNT sketch algorithm by Charikar
et al. [ 17 ] is similar. It uses an additional set of hash functions to decide whether to
increment or decrement, and the estimate is the median of the d values rather than
the minimum.
3
Frequent Itemset Mining Algorithms
For static datasets, the classic method for finding frequent patterns is the Apriori
approach [ 2 ]. However, even for static data, the Apriori method is inefficient for
large datasets for two reasons: (1) it makes numerous passes over the data, and (2) it
starts with a potentially large number of small candidate itemsets. Due to the single-
pass constraint, most frequent pattern mining algorithms for streaming data produce
settle for an appropriate set of frequent patterns. That is, we do not for certain that
the result set is exactly equal to the true set of frequent patterns. The algorithms fall
into two categories: those that produce false positive results and those that produce
false negative results. A false positive algorithm guarantees that its result set includes
every true frequent pattern, but it may include some additional ones. A false negative
algorithm guarantees that every pattern it returns is frequent, but it may fail to detect
some true frequent ones.
Search WWH ::




Custom Search