Database Reference
In-Depth Information
than the total number of items. The number of baskets is usually assumed to be very large,
bigger than what can fit in main memory. The data is assumed to be represented in a file
consisting of a sequence of baskets. In terms of the distributed file system described in Sec-
tion 2.1 , the baskets are the objects of the file, and each basket is of type “set of items.”
6.1.1
Definition of Frequent Itemsets
Intuitively, a set of items that appears in many baskets is said to be “frequent.” To be form-
al, we assume there is a number s , called the support threshold . If I is a set of items, the
support for I is the number of baskets for which I is a subset. We say I is frequent if its
support is s or more.
EXAMPLE 6.1 In Fig. 6.1 are sets of words. Each set is a basket, and the words are items.
We took these sets by Googling cat dog and taking snippets from the highest-ranked
pages. Do not be concerned if a word appears twice in a basket, as baskets are sets, and in
principle items can appear only once. Also, ignore capitalization.
Figure 6.1 Here are eight baskets, each consisting of items that are words
Since the empty set is a subset of any set, the support for is 8. However, we shall not
generally concern ourselves with the empty set, since it tells us nothing.
Among the singleton sets, obviously {cat} and {dog} are quite frequent. “Dog” appears
in all but basket (5), so its support is 7, while “cat” appears in all but (4) and (8), so its
support is 6. The word “and” is also quite frequent; it appears in (1), (2), (5), (7), and (8),
so its support is 5. The words “a” and “training” appear in three sets, while “for” and “is”
appear in two each. No other word appears more than once.
Suppose that we set our threshold at s = 3. Then there are five frequent singleton item-
sets: {dog}, {cat}, {and}, {a}, and {training}.
Now, let us look at the doubletons. A doubleton cannot be frequent unless both items in
the set are frequent by themselves. Thus, there are only ten possible frequent doubletons.
Fig. 6.2 is a table indicating which baskets contain which doubletons.
Search WWH ::




Custom Search