Database Reference
In-Depth Information
(5) There are 2 M pairs that occur exactly once. Of these pairs, M consist of two frequent
items; the other M each have at least one nonfrequent item.
(6) No other pairs occur at all.
(7) Integers are always represented by 4 bytes.
Suppose we run the A-Priori Algorithm and can choose on the second pass between the
triangular-matrix method for counting candidate pairs and a hash table of item-item-count
triples. Neglect in the first case the space needed to translate between original item num-
bers and numbers for the frequent items, and in the second case neglect the space needed
for the hash table. As a function of N and M , what is the minimum number of bytes of main
memory needed to execute the A-Priori Algorithm on this data?
6.3 Handling Larger Datasets in Main Memory
The A-Priori Algorithm is fine as long as the step with the greatest requirement for main
memory - typically the counting of the candidate pairs C 2 - has enough memory that it
can be accomplished without thrashing (repeated moving of data between disk and main
memory). Several algorithms have been proposed to cut down on the size of candidate set
C 2 . Here, we consider the PCY Algorithm, which takes advantage of the fact that in the
first pass of A-Priori there is typically lots of main memory not needed for the counting of
single items. Then we look at the Multistage Algorithm, which uses the PCY trick and also
inserts extra passes to further reduce the size of C 2 .
6.3.1
The Algorithm of Park, Chen, and Yu
This algorithm, which we call PCY after its authors, exploits the observation that there may
be much unused space in main memory on the first pass. If there are a million items and
gigabytes of main memory, we do not need more than 10% of the main memory for the two
tables suggested in Fig. 6.3 - a translation table from item names to small integers and an
array to count those integers. The PCY Algorithm uses that space for an array of integers
that generalizes the idea of a Bloom filter (see Section 4.3 ). The idea is shown schematic-
ally in Fig. 6.5 .
Search WWH ::




Custom Search