Frequent Itemsets - Mining of Massive Datasets

Databases Reference

In-Depth Information

probability that it will fail to produce any answer at all. In that case it needs

to be repeated until it gives an answer. However, the average number of passes

needed before it produces all and only the frequent itemsets is a small constant.

Toivonen's algorithm begins by selecting a small sample of the input dataset,

and finding from it the candidate frequent itemsets. The process is exactly that

of Section 6.4.1, except that it is essential the threshold be set to something

less than its proportional value. That is, if the support threshold for the whole

dataset is s, and the sample size is fraction p, then when looking for frequent

itemsets in the sample, use a threshold such as 0.9ps or 0.8ps. The smaller we

make the threshold, the more main memory we need for computing all itemsets

that are frequent in the sample, but the more likely we are to avoid the situation

where the algorithm fails to provide an answer.

Having constructed the collection of frequent itemsets for the sample, we

next construct the negative border. This is the collection of itemsets that are not

frequent in the sample, but all of their immediate subsets (subsets constructed

by deleting exactly one item) are frequent in the sample.

Example 6.11 : Suppose the items are{A, B, C, D, E}and we have found the

following itemsets to be frequent in the sample:{A},{B},{C},{D},{B, C},

{C, D}. Note that∅is also frequent, as long as there are at least as many

baskets as the support threshold, although technically the algorithms we have

described omit this obvious fact. First,{E}is in the negative border, because

it is not frequent in the sample, but its only immediate subset,∅, is frequent.

The sets{A, B},{A, C},{A, D}and{B, D}are in the negative border.

None of these sets are frequent, and each has two immediate subsets, both of

which are frequent. For instance,{A, B}has immediate subsets{A}and{B}.

Of the other six doubletons, none are in the negative border. The sets{B, C}

and{C, D}are not in the negative border, because they are frequent. The

remaining four pairs are each E together with another item, and those are not

in the negative border because they have an immediate subset{E}that is not

frequent.

None of the triples or larger sets are in the negative border. For instance,

{B, C, D}is not in the negative border because it has an immediate subset

{B, D}that is not frequent. Thus, the negative border consists of five sets:

{E},{A, B},{A, C},{A, D}and{B, D}.

2

To complete Toivonen's algorithm, we make a pass through the entire data-

set, counting all the itemsets that are frequent in the sample or are in the

negative border. There are two possible outcomes.

1. No member of the negative border is frequent in the whole dataset. In

this case, the correct set of frequent itemsets is exactly those itemsets

from the sample that were found to be frequent in the whole.

2. Some member of the negative border is frequent in the whole. Then we

cannot be sure that there are not some even larger sets, in neither the

Mining of Massive Datasets

Search WWH ::

Custom Search

Home