Database Reference
In-Depth Information
6
Frequent Itemsets
We turn in this chapter to one of the major families of techniques for characterizing data:
the discovery of frequent itemsets. This problem is often viewed as the discovery of “associ-
ation rules,” although the latter is a more complex characterization of data, whose discovery
depends fundamentally on the discovery of frequent itemsets.
To begin, we introduce the “market-basket” model of data, which is essentially a many-
many relationship between two kinds of elements, called “items” and “baskets,” but with
some assumptions about the shape of the data. The frequentitemsets problem is that of find-
ing sets of items that appear in (are related to) many of the same baskets.
The problem of finding frequent itemsets differs from the similarity search discussed in
Chapter 3 . Here we are interested in the absolute number of baskets that contain a particular
set of items. In Chapter 3 we wanted items that have a large fraction of their baskets in com-
mon, even if the absolute number of baskets is small.
The difference leads to a new class of algorithms for finding frequent itemsets. We begin
with the A-Priori Algorithm, which works by eliminating most large sets as candidates by
looking first at smaller sets and recognizing that a large set cannot be frequent unless all its
subsets are. We then consider various improvements to the basic A-Priori idea, concentrat-
ing on very large data sets that stress the available main memory.
Next, we consider approximate algorithms that work faster but are not guaranteed to find
all frequent itemsets. Also in this class of algorithms are those that exploit parallelism, in-
cluding the parallelism we can obtain through a MapReduce formulation. Finally, we discuss
briefly how to find frequent itemsets in a data stream.
6.1 The Market-Basket Model
The market-basket model of data is used to describe a common form of many-many rela-
tionship between two kinds of objects. On the one hand, we have items , and on the other
we have baskets , sometimes called “transactions.” Each basket consists of a set of items (an
itemset ), and usually we assume that the number of items in a basket is small - much smaller
Search WWH ::




Custom Search