Accelerating Queries on Very Large Datasets - Scientific Data Management

Database Reference

In-Depth Information

6.7.1 Data Processing Systems

To access data eciently, the underlying data must be organized in a suitable

manner, since the speed of query processing depends on the data organization.

In most cases, the data organization of a data processing system is inextrica-

bly linked to the system design. Therefore we cannot easily separate the data

organization issue from the systems that support them. Next, we review a few

example systems to see how their data organization affects the query process-

ing speed. Since most of the preceding discussion applies to the traditional

DBMS systems, we will not discuss them any further.

6.7.1.1

Column-Based Systems

The column-based systems are extensively discussed in Chapter 7. Here, we

will only mention some names and give a brief argument on their effectiveness.

There are a number of commercial database systems that organize their

data in column-oriented fashion, for example, Sybase IQ, Vertica, and Kx

Systems. 98 Among them, Kx Systems can be regarded as an array database

because it treats an array as a first-class citizen like an integer number. There

are a number of research systems that use vertical data organization as well,

for example, C-Store, 90 , 91 MonetDB, 16 , 17 and FastBit. One common feature of

all these systems is that they logically organize values of a column together.

This offers a number of advantages. For example, a typical query only involves

a small number of columns; the column-oriented data organization allows the

system to only access the columns involved, which minimizes the I/O time.

In addition, since the values in a column are of the same type, it is easier to

determine the location of each value and avoid accessing irrelevant rows. The

values in a column are more likely to be the same as values from different

columns in row-oriented data organization, which makes it more effective to

apply compression on data. 1

6.7.1.2

Special-Purpose Data Analysis Systems

Most of the scientific data formats such as FITS, NetCDF, and HDF5 come

with their own data access and analysis libraries, and can be considered as

special-purpose data analysis systems. By far the most developed of such

systems is ROOT. 18 , 19 , 70 ROOT is a data management system developed by

physicists originally for high-energy physics data. It currently manages many

petabytes of data around the world, more than many of the well-known com-

mercial DBMS products. ROOT uses an object-oriented metaphor for its data:

a unit of data is called an object or an event (of high-energy collision), which

corresponds to a row in a relational table. The records are grouped into files,

and the primary access method to records in a file is to iterate through them

with an iterator. Once an event is available to the user, all of its attributes

are available. This is essentially the row-oriented data access. In recent ver-

sions of ROOT, it is possible to split some attributes of an event to store

Scientific Data Management

Search WWH ::

Custom Search

Home