Database Reference
In-Depth Information
The quantity ( i k ) is the number of buckets between when the pattern was added
and the present, so the pruning rule amounts to saying that a pattern must occur on
average once per bucket.
If and only if a pattern occurs on average once per bucket, LOSSY COUNTING will
record it and not prune it. The rate of occurrence of a pattern is its support, and since
the bucket size is chosen to be 1 / , the minimum support level for not being pruned
is exactly , the error rate. The converse rule is that if a pattern's average support
is less than once per bucket, it will NOT be recorded. This is the “loss” in lossy
counting and the foundation for the true positive guarantee: the estimated support
undercounts the true support by no more that .
The main problem with LOSSY COUNTING is that it must record a relatively large
amount of data. For example, suppose that θ
0 . 01. In order to
guarantee that every reported frequent pattern has an actual support of at least 10-
1%=9%,LOSSY COUNTING would need to remember every pattern that occurs
with only 1 % support. There may be orders of magnitude more patterns that satisfy
a support level of vs. a support level of θ . Hence, numerous works have striven to
reduce this burden, but using more sophisticated schemes that achieve a good error
rate without imposing as large a memory requirement.
=
0 . 10 and
=
FPDM, A True Negative Algorithm In response the Manku and Motwani's work,
Yu et al. propose a work which takes a much different tack [ 77 ]. Their algorithm
does not allow false positives, and has a high-probability of finding itemsets which
are truly frequent. In particular, they use a user-defined parameter δ to control the
probability of finding frequent itemsets which satisfy support level θ . Specifically,
the result set does not include any itemsets whose frequency is less than θ , whereas
it includes any θ -frequent itemset with probability of at least 1
δ . It utilizes the
Chernoff bound to achieve such probabilistic quality guarantee.
We can model the appearance of an itemset as a binomial random variable, mean-
ing that the pattern either appears or does not in each transaction. Our support
threshold θ serves as the variable's expected value. If in n trials the actual number
of appearances is
f , then the Chernoff bound states that
2 e n 2
{| f/n
Pr
θ
|≥
}≤
(9.2)
2 θ
The term on the left side of the inequality is the probability that the observed rate
of occurrences will differ from the expected rate by at least a given error threshold.
We define a parameter δ to be equal to the term on the right side of the inequality.
That is, it is our target confidence level that our observed support is -close to the
true support. Then by rearranging,
2 θ ln (2 )
n
n =
(9.3)
Equation 9.3 expresses the mutual dependence between these several parameters.
For a fixed support threshold θ , the error n between the true and observed support
levels will diminish as n increases. The confidence factor δ is inversely proportional
to the error rate and the necessary number of trials.
 
Search WWH ::




Custom Search