Frequent Itemsets - Mining of Massive Datasets

Databases Reference

In-Depth Information

(a) Use the file that was collected while the first iteration of the algo-

rithm was running. At the same time, collect yet another file to be

used at another iteration of the algorithm, when this current itera-

tion finishes.

(b) Start collecting another file of baskets now, and run the algorithm

when an adequate number of baskets has been collected.

2. We can continue to count the numbers of occurrences of each of these

frequent itemsets, along with the total number of baskets seen in the

stream, since the counting started. If any itemset is discovered to occur

in a fraction of the baskets that is significantly below the threshold fraction

s, then this set can be dropped from the collection of frequent itemsets.

When computing the fraction, it is important to include the occurrences

from the original file of baskets, from which the frequent itemsets were

derived. If not, we run the risk that we shall encounter a short period in

which a truly frequent itemset does not appear su ciently frequently and

throw it out. We should also allow some way for new frequent itemsets

to be added to the current collection. Possibilities include:

(a) Periodically gather a new segment of the baskets in the stream and

use it as the data file for another iteration of the chosen frequent-

itemsets algorithm. The new collection of frequent items is formed

from the result of this iteration and the frequent itemsets from the

previous collection that have survived the possibility of having been

deleted for becoming infrequent.

(b) Add some random itemsets to the current collection, and count their

fraction of occurrences for a while, until one has a good idea of

whether or not they are currently frequent. Rather than choosing

new itemsets completely at random, one might focus on sets with

items that appear in many itemsets already known to be frequent.

For example, a good choice is to pick new itemsets from the negative

border (Section 6.4.5) of the current set of frequent itemsets.

6.5.2 Frequent Itemsets in Decaying Windows

Recall from Section 4.7 that a decaying window on a stream is formed by picking

a small constant c and giving the ith element prior to the most recent element

the weight (1−c) i , or approximately e −ci . Section 4.7.3 actually presented a

method for computing the frequent items, provided the support threshold is

defined in a somewhat different way. That is, we considered, for each item, a

stream that had 1 if the item appeared at a certain stream element and 0 if not.

We defined the “score” for that element to be the sum of the weights where

the stream element was 1. We were constrained to record all items whose score

was at least 1/2. We can not use a score threshold above 1, because we do not

initiate a count for an item until the item appears in the stream, and the first

Search WWH ::

Custom Search

Home