Database Reference
In-Depth Information
The quantity (
i
−
k
) is the number of buckets between when the pattern was added
and the present, so the pruning rule amounts to saying that a pattern must occur on
average once per bucket.
If and only if a pattern occurs on average once per bucket, LOSSY COUNTING will
record it and not prune it. The rate of occurrence of a pattern is its support, and since
the bucket size is chosen to be 1
/
, the minimum support level for not being pruned
is exactly
, the error rate. The converse rule is that if a pattern's average support
is less than once per bucket, it will NOT be recorded. This is the “loss” in lossy
counting and the foundation for the true positive guarantee: the estimated support
undercounts the true support by no more that
.
The main problem with LOSSY COUNTING is that it must record a relatively large
amount of data. For example, suppose that
θ
0
.
01. In order to
guarantee that every reported frequent pattern has an actual support of at least 10-
1%=9%,LOSSY COUNTING would need to remember every pattern that occurs
with only 1 % support. There may be orders of magnitude more patterns that satisfy
a support level of
vs. a support level of
θ
. Hence, numerous works have striven to
reduce this burden, but using more sophisticated schemes that achieve a good error
rate without imposing as large a memory requirement.
=
0
.
10 and
=
FPDM, A True Negative Algorithm
In response the Manku and Motwani's work,
Yu et al. propose a work which takes a much different tack [
77
]. Their algorithm
does not allow false positives, and has a high-probability of finding itemsets which
are truly frequent. In particular, they use a user-defined parameter
δ
to control the
probability of finding frequent itemsets which satisfy support level
θ
. Specifically,
the result set does not include any itemsets whose frequency is less than
θ
, whereas
it includes any
θ
-frequent itemset with probability of at least 1
−
δ
. It utilizes the
Chernoff bound to achieve such probabilistic quality guarantee.
We can model the appearance of an itemset as a binomial random variable, mean-
ing that the pattern either appears or does not in each transaction. Our support
threshold
θ
serves as the variable's expected value. If in
n
trials the actual number
of appearances is
f
, then the Chernoff bound states that
2
e
−
n
2
{|
f/n
Pr
−
θ
|≥
}≤
(9.2)
2
θ
The term on the left side of the inequality is the probability that the observed rate
of occurrences will differ from the expected rate by at least a given error threshold.
We define a parameter
δ
to be equal to the term on the right side of the inequality.
That is, it is our target confidence level that our observed support is
-close to the
true support. Then by rearranging,
2
θ
ln (2
/δ
)
n
n
=
(9.3)
Equation
9.3
expresses the mutual dependence between these several parameters.
For a fixed support threshold
θ
, the error
n
between the true and observed support
levels will diminish as
n
increases. The confidence factor
δ
is inversely proportional
to the error rate and the necessary number of trials.