Databases Reference
In-Depth Information
probability that it will fail to produce any answer at all. In that case it needs
to be repeated until it gives an answer. However, the average number of passes
needed before it produces all and only the frequent itemsets is a small constant.
Toivonen's algorithm begins by selecting a small sample of the input dataset,
and finding from it the candidate frequent itemsets. The process is exactly that
of Section 6.4.1, except that it is essential the threshold be set to something
less than its proportional value. That is, if the support threshold for the whole
dataset is s, and the sample size is fraction p, then when looking for frequent
itemsets in the sample, use a threshold such as 0.9ps or 0.8ps. The smaller we
make the threshold, the more main memory we need for computing all itemsets
that are frequent in the sample, but the more likely we are to avoid the situation
where the algorithm fails to provide an answer.
Having constructed the collection of frequent itemsets for the sample, we
next construct the negative border. This is the collection of itemsets that are not
frequent in the sample, but all of their immediate subsets (subsets constructed
by deleting exactly one item) are frequent in the sample.
Example 6.11 : Suppose the items are{A, B, C, D, E}and we have found the
following itemsets to be frequent in the sample:{A},{B},{C},{D},{B, C},
{C, D}. Note that∅is also frequent, as long as there are at least as many
baskets as the support threshold, although technically the algorithms we have
described omit this obvious fact. First,{E}is in the negative border, because
it is not frequent in the sample, but its only immediate subset,∅, is frequent.
The sets{A, B},{A, C},{A, D}and{B, D}are in the negative border.
None of these sets are frequent, and each has two immediate subsets, both of
which are frequent. For instance,{A, B}has immediate subsets{A}and{B}.
Of the other six doubletons, none are in the negative border. The sets{B, C}
and{C, D}are not in the negative border, because they are frequent. The
remaining four pairs are each E together with another item, and those are not
in the negative border because they have an immediate subset{E}that is not
frequent.
None of the triples or larger sets are in the negative border. For instance,
{B, C, D}is not in the negative border because it has an immediate subset
{B, D}that is not frequent. Thus, the negative border consists of five sets:
{E},{A, B},{A, C},{A, D}and{B, D}.
2
To complete Toivonen's algorithm, we make a pass through the entire data-
set, counting all the itemsets that are frequent in the sample or are in the
negative border. There are two possible outcomes.
1. No member of the negative border is frequent in the whole dataset. In
this case, the correct set of frequent itemsets is exactly those itemsets
from the sample that were found to be frequent in the whole.
2. Some member of the negative border is frequent in the whole. Then we
cannot be sure that there are not some even larger sets, in neither the
Search WWH ::




Custom Search