Frequent Itemsets - Mining of Massive Datasets

Databases Reference

In-Depth Information

Chapter 6

Frequent Itemsets

We turn in this chapter to one of the major families of techniques for character-

izing data: the discovery of frequent itemsets. This problem is often viewed as

the discovery of “association rules,” although the latter is a more complex char-

acterization of data, whose discovery depends fundamentally on the discovery

of frequent itemsets.

To begin, we introduce the “market-basket” model of data, which is essen-

tially a many-many relationship between two kinds of elements, called “items”

and “baskets,” but with some assumptions about the shape of the data. The

frequent-itemsets problem is that of finding sets of items that appear in (are

related to) many of the same baskets.

The problem of finding frequent itemsets differs from the similarity search

discussed in Chapter 3. Here we are interested in the absolute number of baskets

that contain a particular set of items. In Chapter 3 we wanted items that have

a large fraction of their baskets in common, even if the absolute number of

baskets is small.

The difference leads to a new class of algorithms for finding frequent item-

sets. We begin with the A-Priori Algorithm, which works by eliminating most

large sets as candidates by looking first at smaller sets and recognizing that a

large set cannot be frequent unless all its subsets are. We then consider various

improvements to the basic A-Priori idea, concentrating on very large data sets

that stress the available main memory.

Next, we consider approximate algorithms that work faster but are not

guaranteed to find all frequent itemsets. Also in this class of algorithms are

those that exploit parallelism, including the parallelism we can obtain through

a map-reduce formulation.

Finally, we discuss briefly how to find frequent

itemsets in a data stream.

183

Search WWH ::

Custom Search

Home