Database Reference
In-Depth Information
6.6 Summary of Chapter 6
Market-Basket Data : This model of data assumes there are two kinds of entities: items and baskets. There is a
many-many relationship between items and baskets. Typically, baskets are related to small sets of items, while items
may be related to many baskets.
Frequent Itemsets : The support for a set of items is the number of baskets containing all those items. Itemsets with
support that is at least some threshold are called frequent itemsets.
Association Rules : These are implications that if a basket contains a certain set of items I , then it is likely to contain
another particular item j as well. The probability that j is also in a basket containing I is called the confidence of
the rule. The interest of the rule is the amount by which the confidence deviates from the fraction of all baskets that
contain j .
The Pair-Counting Bottleneck : To find frequent itemsets, we need to examine all baskets and count the number of
occurrences of sets of a certain size. For typical data, with a goal of producing a small number of itemsets that are the
most frequent of all, the part that often takes the most main memory is the counting of pairs of items. Thus, methods
for finding frequent itemsets typically concentrate on how to minimize the main memory needed to count pairs.
Triangular Matrices : While one could use a two-dimensional array to count pairs, doing so wastes half the space,
because there is no need to count pair { i, j } in both the i - j and j - i array elements. By arranging the pairs ( i, j ) for
which i < j in lexicographic order, we can store only the needed counts in a one-dimensional array with no wasted
space, and yet be able to access the count for any pair efficiently.
Storage of Pair Counts as Triples : If fewer than 1/3 of the possible pairs actually occur in baskets, then it is more
space-efficient to store counts of pairs as triples ( i, j, c ), where c is the count of the pair { i, j }, and i < j . An index
structure such as a hash table allows us to find the triple for ( i, j ) efficiently.
Monotonicity of Frequent Itemsets : An important property of itemsets is that if a set of items is frequent, then so are
all its subsets. We exploit this property to eliminate the need to count certain itemsets by using its contrapositive: if
an itemset is not frequent, then neither are its supersets.
The A-Priori Algorithm for Pairs : We can find all frequent pairs by making two passes over the baskets. On the first
pass, we count the items themselves, and then determine which items are frequent. On the second pass, we count
only the pairs of items both of which are found frequent on the first pass. Monotonicity justifies our ignoring other
pairs.
Finding Larger Frequent Itemsets : A-Priori and many other algorithms allow us to find frequent itemsets larger than
pairs, if we make one pass over the baskets for each size itemset, up to some limit. To find the frequent itemsets of
size k , monotonicity lets us restrict our attention to only those itemsets such that all their subsets of size k − 1 have
already been found frequent.
The PCY Algorithm : This algorithm improves on A-Priori by creating a hash table on the first pass, using all main-
memory space that is not needed to count the items. Pairs of items are hashed, and the hash-table buckets are used
as integer counts of the number of times a pair has hashed to that bucket. Then, on the second pass, we only have to
count pairs of frequent items that hashed to a frequent bucket (one whose count is at least the support threshold) on
the first pass.
The Multistage Algorithm : We can insert additional passes between the first and second pass of the PCY Algorithm
to hash pairs to other, independent hash tables. At each intermediate pass, we only have to hash pairs of frequent
items that have hashed to frequent buckets on all previous passes.
The Multihash Algorithm : We can modify the first pass of the PCY Algorithm to divide available main memory into
several hash tables. On the second pass, we only have to count a pair of frequent items if they hashed to frequent
buckets in all hash tables.
Randomized Algorithms : Instead of making passes through all the data, we may choose a random sample of the bas-
kets, small enough that it is possible to store both the sample and the needed counts of itemsets in main memory.
The support threshold must be scaled down in proportion. We can then find the frequent itemsets for the sample, and
hope that it is a good representation of the data as whole. While this method uses at most one pass through the whole
Search WWH ::




Custom Search