Databases Reference
In-Depth Information
F
The PCY Algorithm: This algorithm improves on A-Priori by creating
a hash table on the first pass, using all main-memory space that is not
needed to count the items. Pairs of items are hashed, and the hash-table
buckets are used as integer counts of the number of times a pair has hashed
to that bucket. Then, on the second pass, we only have to count pairs of
frequent items that hashed to a frequent bucket (one whose count is at
least the support threshold) on the first pass.
F
The Multistage Algorithm: We can insert additional passes between the
first and second pass of the PCY Algorithm to hash pairs to other, in-
dependent hash tables. At each intermediate pass, we only have to hash
pairs of frequent items that have hashed to frequent buckets on all previ-
ous passes.
F
The Multihash Algorithm: We can modify the first pass of the PCY Algo-
rithm to divide available main memory into several hash tables. On the
second pass, we only have to count a pair of frequent items if they hashed
to frequent buckets in all hash tables.
F
Randomized Algorithms: Instead of making passes through all the data,
we may choose a random sample of the baskets, small enough that it is
possible to store both the sample and the needed counts of itemsets in
main memory. The support threshold must be scaled down in proportion.
We can then find the frequent itemsets for the sample, and hope that it
is a good representation of the data as whole. While this method uses at
most one pass through the whole dataset, it is subject to false positives
(itemsets that are frequent in the sample but not the whole) and false
negatives (itemsets that are frequent in the whole but not the sample).
F
The SON Algorithm: An improvement on the simple randomized algo-
rithm is to divide the entire file of baskets into segments small enough
that all frequent itemsets for the segment can be found in main memory.
Candidate itemsets are those found frequent for at least one segment. A
second pass allows us to count all the candidates and find the exact col-
lection of frequent itemsets. This algorithm is especially appropriate in a
map-reduce setting.
F
Toivonen's Algorithm: This algorithm starts by finding frequent itemsets
in a sample, but with the threshold lowered so there is little chance of
missing an itemset that is frequent in the whole. Next, we examine the
entire file of baskets, counting not only the itemsets that are frequent in
the sample, but also, the negative border - itemsets that have not been
found frequent, but all their immediate subsets are. If no member of the
negative border is found frequent in the whole, then the answer is exact.
But if a member of the negative border is found frequent, then the whole
process has to repeat with another sample.
Search WWH ::




Custom Search