Databases Reference
In-Depth Information
(a) Use the file that was collected while the first iteration of the algo-
rithm was running. At the same time, collect yet another file to be
used at another iteration of the algorithm, when this current itera-
tion finishes.
(b) Start collecting another file of baskets now, and run the algorithm
when an adequate number of baskets has been collected.
2. We can continue to count the numbers of occurrences of each of these
frequent itemsets, along with the total number of baskets seen in the
stream, since the counting started. If any itemset is discovered to occur
in a fraction of the baskets that is significantly below the threshold fraction
s, then this set can be dropped from the collection of frequent itemsets.
When computing the fraction, it is important to include the occurrences
from the original file of baskets, from which the frequent itemsets were
derived. If not, we run the risk that we shall encounter a short period in
which a truly frequent itemset does not appear su ciently frequently and
throw it out. We should also allow some way for new frequent itemsets
to be added to the current collection. Possibilities include:
(a) Periodically gather a new segment of the baskets in the stream and
use it as the data file for another iteration of the chosen frequent-
itemsets algorithm. The new collection of frequent items is formed
from the result of this iteration and the frequent itemsets from the
previous collection that have survived the possibility of having been
deleted for becoming infrequent.
(b) Add some random itemsets to the current collection, and count their
fraction of occurrences for a while, until one has a good idea of
whether or not they are currently frequent. Rather than choosing
new itemsets completely at random, one might focus on sets with
items that appear in many itemsets already known to be frequent.
For example, a good choice is to pick new itemsets from the negative
border (Section 6.4.5) of the current set of frequent itemsets.
6.5.2 Frequent Itemsets in Decaying Windows
Recall from Section 4.7 that a decaying window on a stream is formed by picking
a small constant c and giving the ith element prior to the most recent element
the weight (1−c) i , or approximately e −ci . Section 4.7.3 actually presented a
method for computing the frequent items, provided the support threshold is
defined in a somewhat different way. That is, we considered, for each item, a
stream that had 1 if the item appeared at a certain stream element and 0 if not.
We defined the “score” for that element to be the sum of the weights where
the stream element was 1. We were constrained to record all items whose score
was at least 1/2. We can not use a score threshold above 1, because we do not
initiate a count for an item until the item appears in the stream, and the first
Search WWH ::




Custom Search