Database Reference
In-Depth Information
Suppose item
e
is frequent, meaning that it occurs at least
Nθ
times in a dataset.
The KPS method randomly removes
unique
items from the dataset. This
eliminates at most one instance of
e
. We can remove at most
1
/θ
θ
such sets, with the
final set possibly being smaller than
. It follows that the final set must contain
e
. However, there is no way to detect which members of the result set are frequent
and which are not.
Jin and Agrawal enhance KPS to use a single pass and to implement an accuracy
bound analogous to Manku and Motwani's work [
61
]. Besides reporting all items
or itemsets that occur with frequency more than
Nθ
, they report only the items
or itemsets which appear with frequency at least
Nθ
(1
1
/θ
1.
Algorithm 7 (BOUNDEDCOUNT outlines their method. In the first stage, we invoke
the algorithm from Karp et al., with frequency level
θ
.
−
), where 0
<
≤
is the set of potentially
frequent items. We maintain a
count
for each item in the set
P
. This set is initially
empty. As the algorithm processes a new item from a sequence, we check if it is in
the set
P
. If yes, its count is incremented, otherwise, it is inserted with a count of
1. When the size of the set
P
P
becomes larger than
1
/θ
, we decrement the count
of each item in
, and eliminate any item whose count has now become 0. This
processing is equivalent to the removals we described earlier. We also record the
number of elimination rounds,
c
, that occur. Clearly,
c
P
≤
Nθ
. In the second step,
we remove all items whose reported frequency is less than
Nθ
−
≥
−
c
Nθ
(1
).