Database Reference
In-Depth Information
us whether the resulting set of frequent patterns will be exactly correct, may include
false positive results, false negative results, or both. The last column provides the
authors' names for their algorithms and data structures.
In the following, we examine a range of algorithms, to see the benefits and trade-
offs involved for the different types of time windows, update intervals, and accuracy
guarantees.
3.1
Mining the Full Data Stream
In the most basic version of frequent itemset mining, we seek to identify every
itemset that occurs with a support level greater than θ , across the full history of the
data stream. However, if we want exact results and if we are to consider the complete
history, then it is necessary to record every arriving pattern, either directly on in some
compressed format. If we fail to record even a few infrequent pattern occurrences,
then we have miscounted. If a pattern later becomes more frequent, then the few
occurrences that we skipped could make the difference between exceeding the θ
threshold or not. However, to count every itemset can easily exceed the available
memory. Therefore, several approximation techniques have been developed.
The approximation algorithms generate a result set , a set of itemsets which may or
may not be exactly equal to all those whose support level exceeds θ . The algorithms
fall into two categories: those that produce false positive results and those that produce
false negative results. A false positive algorithm guarantees that its result set includes
every true frequent pattern, but it may include some additional ones. A false negative
algorithm guarantees that every pattern it returns is frequent, but it may fail to detect
some true frequent ones.
Lossy Counting, A True Positive Algorithm Manku and Motwani proposed the
first one-pass algorithm, LOSSY COUNTING, to find all frequent itemsets over a data
stream [ 61 ]. Many of the algorithms developed since then still use the basic idea
behind lossy counting. Their algorithm is false positive oriented in the sense that it
does not allow false negatives, and it has a provable bound on false positives. It uses
a user-defined error parameter to control the quality of the result set for a given
support level θ . More precisely, its result set is guaranteed to have all itemsets whose
frequency exceeds θ , and it contains no itemsets whose true frequency is less than
θ
. In other words, the itemsets whose frequency are between θ
and θ possibly
appear in the result set and are the false positives.
The algorithm maintains a prefix tree T of potentially frequent patterns. As data
are streaming as part of the k th bucket, every pattern is recorded. Patterns are recorded
in tuple form:
f ( p ), err ( p )
f ( p ) is the number of occurrences of pattern
p ,
, where
=
p since its inclusion in T , and err ( p )
1. This error is the number of buckets
that have passed prior to the pattern being added to T . If a pattern is not yet in the
tree, a new tuple is created. If a pattern is already in the tree, then f is incremented.
The tree is pruned at the conclusion of each bucket. A pattern is deleted from
bucket i if
k
f ( p ) <i
err ( p ). Recall that err ( p ) relates to the bucket during which
the pattern was first added to T . In other words, a pattern is pruned if
f ( p ) <i k +
1.
Search WWH ::




Custom Search