Emerging Database Systems in Support of Scientific Data - Scientific Data Management

Database Reference

In-Depth Information

projections (expression calculation) do this easily, but Boncz et al. 31 try

to achieve the same for other query processing operators as well (e.g.,

aggregation). This allows compilers to produce ecient loop-pipelined

code.

7.2.4 Data Compression

There are two obvious ways a DBMS can trade CPU cycles to save disk

space and thereby I/O bandwidth, a precious resource. 17 In this context, we

are concerned with I/O bandwidth only, since disk space itself has recently

become so cheap that its cost rarely matters at all except perhaps in extreme

applications, such as large-scale Web search. 34 First, data may be coded into

a more compact form. 8 For example, if one is storing an attribute that is

a U.S. customer's state of residence, the state can be coded directly into 6

bits, whereas the standard two-character abbreviation requires 16 bits and a

variable-length character string for the full name of the state requires many

more. Second, data values may be packed compactly by storing N values in

K*N bits, where K is the smallest byte size that can hold any value in the

column ( bit packing ). Of course, more sophisticated schemes may be used,

which can save even more space while allowing for flexible updates with data

items which require more than K bits. It is also possible to use additional

techniques, in particular sort order, to save additional I/O bandwidth. Note

that these simple data compression techniques are equally applicable to row-

wise as to column-wise storage schemes. Therefore, when a query is processed

using a column-wise storage scheme, what makes the basic difference with

respect to I/O bandwidth is the fact that there is no need to transfer data

from irrelevant columns into main memory. As we will see, however, additional,

more sophisticated compression techniques may also be exploited for column-

wise storage.

Although Abadi et al. 18 state that “it was not until the 90s when researchers

began to concentrate on how compression affects database performance,” the

two early papers 16 , 35 both emphasize in different ways that fast data access

may be achieved through judicious use of data compression in fully trans-

posed ( a

vertically partitioned) files. Early tutorials on the subject in-

clude Severance, 36 Bassiouni, 37 and Roth and Van Horn; 38 however, none of

them specifically discusses how to achieve query evaluation-speed improve-

ments. Recent papers on the use of data compression in relational databases

are Westmann et al., 39 Abadi et al., 18 and Raman and Swart. 40 The most sig-

nificant advantages are typically obtained when combining data compression

with fully transposed file or DSM storage, but there are also proposals to use

compression in row-oriented storage schemes.

Abadi et al. 18 discuss an extension of the column-store storage and access

subsystem of the C-Store system, 8 while also addressing the issue of querying

compressed data. The extended column storage exploits the fact that “sorted

data is usually quite compressible” 18 (p. 671), and suggests storing columns in

.

k

.

a

.

Scientific Data Management

Search WWH ::

Custom Search

Home