Databases Reference
In-Depth Information
use a one-dimensional triangular array. We store in a[k] the count for the pair
{i, j}, with 1≤i < j≤n, where
n− i
2
k = (i−1)
+ j−i
The result of this layout is that the pairs are stored in lexicographic order, that
is first{1, 2},{1, 3}, . . . ,{1, n}, then{2, 3},{2, 4}, . . . ,{2, n}, and so on, down
to{n−2, n−1},{n−2, n}, and finally{n−1, n}.
The Triples Method
There is another approach to storing counts that may be more appropriate,
depending on the fraction of the possible pairs of items that actually appear in
some basket. We can store counts as triples [i, j, c], meaning that the count of
pair{i, j}, with i < j, is c. A data structure, such as a hash table with i and
j as the search key, is used so we can tell if there is a triple for a given i and j
and, if so, to find it quickly. We call this approach the triples method of storing
counts.
Unlike the triangular matrix, the triples method does not require us to store
anything if the count for a pair is 0. On the other hand, the triples method
requires us to store three integers, rather than one, for every pair that does
appear in some basket. In addition, there is the space needed for the hash
table or other data structure used to support e cient retrieval. The conclusion
is that the triangular matrix will be better if at least 1/3 of the
n
2
possible
pairs actually appear in some basket, while if significantly fewer than 1/3 of the
possible pairs occur, we should consider using the triples method.
Example 6.6 : Suppose there are 100,000 items, and 10,000,000 baskets of 10
items each.
100000
2
= 5×10 9
Then the triangular-matrix method requires
(approximately) integer counts. 1
On the other hand, the total number of pairs
10
2
among all the baskets is 10 7
= 4.5×10 8 . Even in the extreme case that
every pair of items appeared only once, there could be only 4.5×10 8 pairs with
nonzero counts. If we used the triples method to store counts, we would need
only three times that number of integers, or 1.35×10 9 integers. Thus, in this
case the triples method will surely take much less space than the triangular
matrix.
However, even if there were ten or a hundred times as many baskets, it
would be normal for there to be a su ciently uneven distribution of items that
we might still be better off using the triples method. That is, some pairs would
have very high counts, and the number of different pairs that occurred in one
or more baskets would be much less than the theoretical maximum number of
such pairs.
2
` n
2
´
1 Here, and throughout the chapter, we shall use the approximation that
= n 2 /2 for
large n.
Search WWH ::




Custom Search