Frequent Pattern Mining in Data Streams - Frequent Pattern Mining

Database Reference

In-Depth Information

The quantity ( i − k ) is the number of buckets between when the pattern was added

and the present, so the pruning rule amounts to saying that a pattern must occur on

average once per bucket.

If and only if a pattern occurs on average once per bucket, LOSSY COUNTING will

record it and not prune it. The rate of occurrence of a pattern is its support, and since

the bucket size is chosen to be 1 / , the minimum support level for not being pruned

is exactly , the error rate. The converse rule is that if a pattern's average support

is less than once per bucket, it will NOT be recorded. This is the “loss” in lossy

counting and the foundation for the true positive guarantee: the estimated support

undercounts the true support by no more that .

The main problem with LOSSY COUNTING is that it must record a relatively large

amount of data. For example, suppose that θ

0 . 01. In order to

guarantee that every reported frequent pattern has an actual support of at least 10-

1%=9%,LOSSY COUNTING would need to remember every pattern that occurs

with only 1 % support. There may be orders of magnitude more patterns that satisfy

a support level of vs. a support level of θ . Hence, numerous works have striven to

reduce this burden, but using more sophisticated schemes that achieve a good error

rate without imposing as large a memory requirement.

=

0 . 10 and

=

FPDM, A True Negative Algorithm In response the Manku and Motwani's work,

Yu et al. propose a work which takes a much different tack [ 77 ]. Their algorithm

does not allow false positives, and has a high-probability of finding itemsets which

are truly frequent. In particular, they use a user-defined parameter δ to control the

probability of finding frequent itemsets which satisfy support level θ . Specifically,

the result set does not include any itemsets whose frequency is less than θ , whereas

it includes any θ -frequent itemset with probability of at least 1

− δ . It utilizes the

Chernoff bound to achieve such probabilistic quality guarantee.

We can model the appearance of an itemset as a binomial random variable, mean-

ing that the pattern either appears or does not in each transaction. Our support

threshold θ serves as the variable's expected value. If in n trials the actual number

of appearances is

f , then the Chernoff bound states that

2 e − n 2

{| f/n

Pr

−

θ

|≥

}≤

(9.2)

2 θ

The term on the left side of the inequality is the probability that the observed rate

of occurrences will differ from the expected rate by at least a given error threshold.

We define a parameter δ to be equal to the term on the right side of the inequality.

That is, it is our target confidence level that our observed support is -close to the

true support. Then by rearranging,

2 θ ln (2 /δ )

n

n =

(9.3)

Equation 9.3 expresses the mutual dependence between these several parameters.

For a fixed support threshold θ , the error n between the true and observed support

levels will diminish as n increases. The confidence factor δ is inversely proportional

to the error rate and the necessary number of trials.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home