Database Reference
In-Depth Information
specifically index scan, hash join, group by with aggregation, and sort-merge
join , directly on compressed data.
In the MonetDB system, 17 two space optimizations have been applied
that reduce the per-tuple memory in BATs (binary association tables, see
Section 7.2.1):
Virtual TIDs. Generally, when decomposing a relational table, MonetDB
avoids allocating the 4-byte field for the TID since it can be inferred
from the data sequence itself. MonetDB's (and DSM's) approach leaves
the possibility open for nonvirtual TIDs as well, a feature that may be
useful, for example, when performing a hash lookup.
Byte encodings. Database columns often have low domain cardinality. For
such columns, MonetDB uses fixed-size encodings in 1- or 2-byte integer
values.
In Karasalo and Svensson, 5 the approach to data compression used in Can-
tor and its testbed is described. It presupposes the existence of an ecient
way to organize attribute subfiles containing varying length data. The so-
called b-list structure developed for this purpose is a B-tree variant suitable
for storing linear lists that exploit an observation made in Knuth 41
and also
shows similarity to a structure discussed in Maruyama. 42
When data are read from or written into a b-list node, data values are
accessed one block at a time. When a block is written, a compression algo-
rithm is first applied to its data, working as follows: First, from all values
in a given sequence, its minimum value is subtracted and stored in a se-
quence header (cf. the FOR technique discussed above). Then, to store an
arbitrary-length subsequence of integers compactly, four alternatives are con-
sidered:
1. Use the smallest possible common byte length for the subsequence and
store the byte length in the header (bit packing)
2. Use the smallest possible common byte length for the difference subse-
quence and store the first element of the original subsequence as well as
the byte length in the header (FOR-delta)
3. If the sequence is a run of equal values, store the value and length of
the subsequence in the subsequence header (RLE)
4. If the subsequence is a run of equal differences, store the first element
of the subsequence, the common difference, and the subsequence length
in the header (delta-RLE)
To combine these alternatives optimally to store a given data sequence (of
length n) as compactly as possible, a dynamic programming, branch-and-
bound algorithm was developed. This algorithm subdivides the sequence into
subsequences, each characterized by storage alternative , byte size , cardinality ,
and size of header , so as to represent the entire sequence using as few bits
Search WWH ::




Custom Search