Frequent Itemsets - Mining of Massive Datasets

Databases Reference

In-Depth Information

F

The PCY Algorithm: This algorithm improves on A-Priori by creating

a hash table on the first pass, using all main-memory space that is not

needed to count the items. Pairs of items are hashed, and the hash-table

buckets are used as integer counts of the number of times a pair has hashed

to that bucket. Then, on the second pass, we only have to count pairs of

frequent items that hashed to a frequent bucket (one whose count is at

least the support threshold) on the first pass.

F

The Multistage Algorithm: We can insert additional passes between the

first and second pass of the PCY Algorithm to hash pairs to other, in-

dependent hash tables. At each intermediate pass, we only have to hash

pairs of frequent items that have hashed to frequent buckets on all previ-

ous passes.

F

The Multihash Algorithm: We can modify the first pass of the PCY Algo-

rithm to divide available main memory into several hash tables. On the

second pass, we only have to count a pair of frequent items if they hashed

to frequent buckets in all hash tables.

F

Randomized Algorithms: Instead of making passes through all the data,

we may choose a random sample of the baskets, small enough that it is

possible to store both the sample and the needed counts of itemsets in

main memory. The support threshold must be scaled down in proportion.

We can then find the frequent itemsets for the sample, and hope that it

is a good representation of the data as whole. While this method uses at

most one pass through the whole dataset, it is subject to false positives

(itemsets that are frequent in the sample but not the whole) and false

negatives (itemsets that are frequent in the whole but not the sample).

F

The SON Algorithm: An improvement on the simple randomized algo-

rithm is to divide the entire file of baskets into segments small enough

that all frequent itemsets for the segment can be found in main memory.

Candidate itemsets are those found frequent for at least one segment. A

second pass allows us to count all the candidates and find the exact col-

lection of frequent itemsets. This algorithm is especially appropriate in a

map-reduce setting.

F

Toivonen's Algorithm: This algorithm starts by finding frequent itemsets

in a sample, but with the threshold lowered so there is little chance of

missing an itemset that is frequent in the whole. Next, we examine the

entire file of baskets, counting not only the itemsets that are frequent in

the sample, but also, the negative border - itemsets that have not been

found frequent, but all their immediate subsets are. If no member of the

negative border is found frequent in the whole, then the answer is exact.

But if a member of the negative border is found frequent, then the whole

process has to repeat with another sample.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home