Frequent Itemsets - Mining of Massive Datasets

Database Reference

In-Depth Information

more main memory, then an option is to read the sample from disk for each pass. Since the

sample is much smaller than the full dataset, we still avoid most of the disk I/O's that the

algorithms discussed previously would use.

6.4.2

Avoiding Errors in Sampling Algorithms

We should be mindful of the problem with the simple algorithm of Section 6.4.1 : it cannot

be relied upon either to produce all the itemsets that are frequent in the whole dataset, nor

will it produce only itemsets that are frequent in the whole. An itemset that is frequent in

the whole but not in the sample is a false negative , while an itemset that is frequent in the

sample but not the whole is a false positive .

If the sample is large enough, there are unlikely to be serious errors. That is, an itemset

whose support is much larger than the threshold will almost surely be identified from a

random sample, and an itemset whose support is much less than the threshold is unlikely

to appear frequent in the sample. However, an itemset whose support in the whole is very

close to the threshold is as likely to be frequent in the sample as not.

We can eliminate false positives by making a pass through the full dataset and counting

all the itemsets that were identified as frequent in the sample. Retain as frequent only those

itemsets that were frequent in the sample and also frequent in the whole. Note that this im-

provement will eliminate all false positives, but a false negative is not counted and there-

fore remains undiscovered.

To accomplish this task in a single pass, we need to be able to count all frequent itemsets

of all sizes at once, within main memory. If we were able to run the simple algorithm suc-

cessfully with the main memory available, then there is a good chance we shall be able to

count all the frequent itemsets at once, because:

(a) The frequent singletons and pairs are likely to dominate the collection of all frequent

itemsets, and we had to count them all in one pass already.

(b) We now have all of main memory available, since we do not have to store the sample

in main memory.

We cannot eliminate false negatives completely, but we can reduce their number if the

amount of main memory allows it. We have assumed that if s is the support threshold, and

the sample is fraction p of the entire dataset, then we use ps as the support threshold for the

sample. However, we can use something smaller than that as the threshold for the sample,

such a 0 . 9 ps . Having a lower threshold means that more itemsets of each size will have to

be counted, so the main-memory requirement rises. On the other hand, if there is enough

main memory, then we shall identify as having support at least 0 . 9 ps in the sample almost

all those itemsets that have support at least s is the whole. If we then make a complete pass

Search WWH ::

Custom Search

Home