Frequent Itemsets - Mining of Massive Datasets

Database Reference

In-Depth Information

(a) Use the file that was collected while the first iteration of the algorithm was run-

ning. At the same time, collect yet another file to be used at another iteration of

the algorithm, when this current iteration finishes.

(b) Start collecting another file of baskets now, and run the algorithm when an ad-

equate number of baskets has been collected.

(2) We can continue to count the numbers of occurrences of each of these frequent item-

sets, along with the total number of baskets seen in the stream, since the counting star-

ted. If any itemset is discovered to occur in a fraction of the baskets that is signific-

antly below the threshold fraction s , then this set can be dropped from the collection of

frequent itemsets. When computing the fraction, it is important to include the occur-

rences from the original file of baskets, from which the frequent itemsets were derived.

If not, we run the risk that we shall encounter a short period in which a truly frequent

itemset does not appear sufficiently frequently and throw it out. We should also allow

some way for new frequent itemsets to be added to the current collection. Possibilities

include:

(a) Periodically gather a new segment of the baskets in the stream and use it as the

data file for another iteration of the chosen frequent-itemsets algorithm. The new

collection of frequent items is formed from the result of this iteration and the fre-

quent itemsets from the previous collection that have survived the possibility of

having been deleted for becoming infrequent.

(b) Add some random itemsets to the current collection, and count their fraction of

occurrences for a while, until one has a good idea of whether or not they are cur-

rently frequent. Rather than choosing new itemsets completely at random, one

might focus on sets with items that appear in many itemsets already known to be

frequent. For example, a good choice is to pick new itemsets from the negative

border ( Section 6.4.5 ) of the current set of frequent itemsets.

6.5.2

Frequent Itemsets in Decaying Windows

Recall from Section 4.7 that a decaying window on a stream is formed by picking a small

constant c and giving the i th element prior to the most recent element the weight (1− c ) i , or

approximately e − ci . Section 4.7.3 actually presented a method for computing the frequent

items, provided the support threshold is defined in a somewhat different way. That is, we

considered, for each item, a stream that had 1 if the item appeared at a certain stream ele-

ment and 0 if not. We defined the “score” for that item to be the sum of the weights where

its stream element was 1. We were constrained to record all items whose score was at least

1/2. We can not use a score threshold above 1, because we do not initiate a count for an

Search WWH ::

Custom Search

Home