Database Reference
In-Depth Information
One feature of RLE-compressed vertical data storage architectures that al-
lows for direct operation on compressed data is that not only can they be
exploited in conjunctive query searches, but they can be used profitably also
in join and set operations on relations, as well as in duplicate removal oper-
ations. An important design feature of Cantor's search and sort subsystem, 5
which contains algorithms for internal and external sorting, duplicate tuple
detection, conjunctive query search, key lookup, merge-join, set union, set
difference, and set intersection, is that all these algorithms are designed to
work one (or a small, fixed number of) attribute(s) at a time, to match
the transposed file principle. In most of these algorithms, scanning a com-
pressed transposed file is done by using the sequential read interface which
retains run-compressed data. Internal sorting is carried out using a modi-
fied Quicksort algorithm, which for each (group of) attribute(s) produces a
stream of TIDs as output, according to which subsequent attributes are to be
permuted.
7.2.7 Compression-Aware Optimizations for the Equi-Join
Operator
In Abadi et al., 18 a discussion of a nested loop-join algorithm capable of
operating directly on compressed data is presented, and the paper also con-
tains pseudo-code showing how the join operator may take into account the
compression state of the input columns. Combinations of the three cases un-
compressed , RLE , and bit-vector encoded data are considered. For example, if
one of the input columns is bit-vector encoded and the other is uncompressed,
then the resulting column of positions for the uncompressed column can be
represented using RLE coding, and the resulting column of positions for the
bit-vector column can be copied from the appropriate bit-vector for the value
that matched the predicate. The authors also present results from several
benchmark tests, showing clear, often order-of-magnitude, performance im-
provements from judicious use of data compression, in particular the use of
RLE on ordered data.
In Svensson, 27 the main equi-join algorithm used in Cantor, based on merg-
ing CFTOF-represented relations, was presented by way of a simple example.
The paper states that “the search [scanning] phase of the equi-join operation
...
is so similar to conjunctive query search that it is to be expected that anal-
ogous results hold,” but no performance measurement data are given. The end
result of the scanning phase is a compressed, ordered list of qualifying TID
pairs, that is, a compressed join index. When a join search is to be done on
nonkey attributes, one or both factors have to be re-sorted first. In such cases,
sorting will dominate in the equi-join process, unless the resulting Cartesian
product has much greater cardinality than its factors. However, in situations
where no re-sorting of operands is necessary, this algorithm provides a fast way
of allowing the equi-join operator to work directly on run-length compressed
data.
Search WWH ::




Custom Search