Database Reference
In-Depth Information
that it never produces a false negative, we must show that when no member of the negative
border is frequent in the whole, then there can be no itemset whatsoever that is:
(1) Frequent in the whole, but
(2) In neither the negative border nor the collection of frequent itemsets for the sample.
Suppose the contrary. That is, there is a set S that is frequent in the whole, but not in the
negative border and not frequent in the sample. Also, this round of Toivonen's Algorithm
produced an answer, which would certainly not include S among the frequent itemsets. By
monotonicity, all subsets of S are also frequent in the whole. Let T be a subset of S that is
of the smallest possible size among all subsets of S that are not frequent in the sample.
We claim that T must be in the negative border. Surely T meets one of the conditions
for being in the negative border: it is not frequent in the sample. It also meets the other
condition for being in the negative border: each of its immediate subsets is frequent in the
sample. For if some immediate subset of T were not frequent in the sample, then there
would be a subset of S that is smaller than T and not frequent in the sample, contradicting
our selection of T as a subset of S that was not frequent in the sample, yet as small as any
such set.
Now we see that T is both in the negative border and frequent in the whole dataset. Con-
sequently, this round of Toivonen's algorithm did not produce an answer.
6.4.7
Exercises for Section 6.4
EXERCISE 6.4.1 Suppose there are eight items, A, B, . . . , H , and the following are the max-
imal frequent itemsets: { A, B }, { B, C }, { A, C }, { A, D }, { E }, and { F }. Find the negative
border.
EXERCISE 6.4.2 Apply Toivonen's Algorithm to the data of Exercise 6.3.1 , with a support
threshold of 4. Take as the sample the first row of baskets: {1 , 2 , 3}, {2 , 3 , 4}, {3 , 4 , 5},
and {4 , 5 , 6}, i.e., one-third of the file. Our scaled-down support theshold will be 1.
(a) What are the itemsets frequent in the sample?
(b) What is the negative border?
(c) What is the outcome of the pass through the full dataset? Are any of the itemsets in the
negative border frequent in the whole?
!! EXERCISE 6.4.3 Suppose item i appears exactly s times in a file of n baskets, where s is
the support threshold. If we take a sample of n /100 baskets, and lower the support threshold
Search WWH ::




Custom Search