Database Reference
In-Depth Information
more main memory, then an option is to read the sample from disk for each pass. Since the
sample is much smaller than the full dataset, we still avoid most of the disk I/O's that the
algorithms discussed previously would use.
6.4.2
Avoiding Errors in Sampling Algorithms
We should be mindful of the problem with the simple algorithm of Section 6.4.1 : it cannot
be relied upon either to produce all the itemsets that are frequent in the whole dataset, nor
will it produce only itemsets that are frequent in the whole. An itemset that is frequent in
the whole but not in the sample is a false negative , while an itemset that is frequent in the
sample but not the whole is a false positive .
If the sample is large enough, there are unlikely to be serious errors. That is, an itemset
whose support is much larger than the threshold will almost surely be identified from a
random sample, and an itemset whose support is much less than the threshold is unlikely
to appear frequent in the sample. However, an itemset whose support in the whole is very
close to the threshold is as likely to be frequent in the sample as not.
We can eliminate false positives by making a pass through the full dataset and counting
all the itemsets that were identified as frequent in the sample. Retain as frequent only those
itemsets that were frequent in the sample and also frequent in the whole. Note that this im-
provement will eliminate all false positives, but a false negative is not counted and there-
fore remains undiscovered.
To accomplish this task in a single pass, we need to be able to count all frequent itemsets
of all sizes at once, within main memory. If we were able to run the simple algorithm suc-
cessfully with the main memory available, then there is a good chance we shall be able to
count all the frequent itemsets at once, because:
(a) The frequent singletons and pairs are likely to dominate the collection of all frequent
itemsets, and we had to count them all in one pass already.
(b) We now have all of main memory available, since we do not have to store the sample
in main memory.
We cannot eliminate false negatives completely, but we can reduce their number if the
amount of main memory allows it. We have assumed that if s is the support threshold, and
the sample is fraction p of the entire dataset, then we use ps as the support threshold for the
sample. However, we can use something smaller than that as the threshold for the sample,
such a 0 . 9 ps . Having a lower threshold means that more itemsets of each size will have to
be counted, so the main-memory requirement rises. On the other hand, if there is enough
main memory, then we shall identify as having support at least 0 . 9 ps in the sample almost
all those itemsets that have support at least s is the whole. If we then make a complete pass
Search WWH ::




Custom Search