Emerging Database Systems in Support of Scientific Data - Scientific Data Management

Database Reference

In-Depth Information

specifically index scan, hash join, group by with aggregation, and sort-merge

join , directly on compressed data.

In the MonetDB system, 17 two space optimizations have been applied

that reduce the per-tuple memory in BATs (binary association tables, see

Section 7.2.1):

Virtual TIDs. Generally, when decomposing a relational table, MonetDB

avoids allocating the 4-byte field for the TID since it can be inferred

from the data sequence itself. MonetDB's (and DSM's) approach leaves

the possibility open for nonvirtual TIDs as well, a feature that may be

useful, for example, when performing a hash lookup.

Byte encodings. Database columns often have low domain cardinality. For

such columns, MonetDB uses fixed-size encodings in 1- or 2-byte integer

values.

In Karasalo and Svensson, 5 the approach to data compression used in Can-

tor and its testbed is described. It presupposes the existence of an ecient

way to organize attribute subfiles containing varying length data. The so-

called b-list structure developed for this purpose is a B-tree variant suitable

for storing linear lists that exploit an observation made in Knuth 41

and also

shows similarity to a structure discussed in Maruyama. 42

When data are read from or written into a b-list node, data values are

accessed one block at a time. When a block is written, a compression algo-

rithm is first applied to its data, working as follows: First, from all values

in a given sequence, its minimum value is subtracted and stored in a se-

quence header (cf. the FOR technique discussed above). Then, to store an

arbitrary-length subsequence of integers compactly, four alternatives are con-

sidered:

1. Use the smallest possible common byte length for the subsequence and

store the byte length in the header (bit packing)

2. Use the smallest possible common byte length for the difference subse-

quence and store the first element of the original subsequence as well as

the byte length in the header (FOR-delta)

3. If the sequence is a run of equal values, store the value and length of

the subsequence in the subsequence header (RLE)

4. If the subsequence is a run of equal differences, store the first element

of the subsequence, the common difference, and the subsequence length

in the header (delta-RLE)

To combine these alternatives optimally to store a given data sequence (of

length n) as compactly as possible, a dynamic programming, branch-and-

bound algorithm was developed. This algorithm subdivides the sequence into

subsequences, each characterized by storage alternative , byte size , cardinality ,

and size of header , so as to represent the entire sequence using as few bits

Search WWH ::

Custom Search

Home