Databases Reference
In-Depth Information
Chapter 6
Frequent Itemsets
We turn in this chapter to one of the major families of techniques for character-
izing data: the discovery of frequent itemsets. This problem is often viewed as
the discovery of “association rules,” although the latter is a more complex char-
acterization of data, whose discovery depends fundamentally on the discovery
of frequent itemsets.
To begin, we introduce the “market-basket” model of data, which is essen-
tially a many-many relationship between two kinds of elements, called “items”
and “baskets,” but with some assumptions about the shape of the data. The
frequent-itemsets problem is that of finding sets of items that appear in (are
related to) many of the same baskets.
The problem of finding frequent itemsets differs from the similarity search
discussed in Chapter 3. Here we are interested in the absolute number of baskets
that contain a particular set of items. In Chapter 3 we wanted items that have
a large fraction of their baskets in common, even if the absolute number of
baskets is small.
The difference leads to a new class of algorithms for finding frequent item-
sets. We begin with the A-Priori Algorithm, which works by eliminating most
large sets as candidates by looking first at smaller sets and recognizing that a
large set cannot be frequent unless all its subsets are. We then consider various
improvements to the basic A-Priori idea, concentrating on very large data sets
that stress the available main memory.
Next, we consider approximate algorithms that work faster but are not
guaranteed to find all frequent itemsets. Also in this class of algorithms are
those that exploit parallelism, including the parallelism we can obtain through
a map-reduce formulation.
Finally, we discuss briefly how to find frequent
itemsets in a data stream.
183
Search WWH ::




Custom Search