Accelerating Queries on Very Large Datasets - Scientific Data Management

Database Reference

In-Depth Information

column-oriented organization is also known as the vertical data organization.

There are many variations based on these two basic organizations. For exam-

ple, a large table is often horizontally split into partitions, where each partition

is then further organized horizontally or vertically. Since the organization of

a partition typically has more impact on query processing, our discussion will

center around how the partitions are organized. The data organization of a

system is typically fixed; therefore, to discuss data organization we cannot

avoid touching on different systems even though they have been discussed

elsewhere already. Most notably, Chapter 7 has extensive information about

systems with vertical data organizations.

This chapter primarily focuses on access methods and mostly on index-

ing techniques to speed up data accesses in query processing. Because these

methods can be implemented in software and have great potential of improv-

ing query performance, there have been extensive research activities on this

subject. To motivate our discussion, we review key characteristics of scientific

data and queries in the next section. In Section 6.3, we present a taxonomy of

index methods. In the following two sections, we review some well-known index

methods, with Section 6.4 on single-column indexing and Section 6.5 on mul-

tidimensional indexing. Given that scientific data are often high-dimensional

data, we present a type of index that has been demonstrated to work well

with this type of data. This type of index is the bitmap index; we devote Sec-

tion 6.6 to discussing the recent advances on the bitmap index. In Section 6.7

we revisit the data organization issue by examining a number of emerging

data processing systems with unusual data organizations. All these systems

do not yet use any indexing methods. We present a small test to demonstrate

that even such systems could benefit from an ecient indexing method.

6.2 Characteristics of Scientific Data

Scientific databases are massive datasets accumulated through scientific ex-

periments, observations, and computations. New and improved instrumenta-

tions now not only provide better data precision but also capture data at a

much faster rate, resulting in large volumes of data. Ever-increasing comput-

ing power is leading to ever-larger and more realistic computation simulations,

which also produce large volumes of data. Analysis of these massive datasets

by domain scientists often involves finding some specific data items that have

some characteristics of particular interest. Unlike the traditional information

management system (IMS), such as management of bank records in the 1970s

and 1980s where the database consisted of a few megabytes of records that

have a small number of attributes, scientific databases typically consist of

terabytes of data (or billions of records) that have hundreds of attributes. Sci-

entific databases are generally organized as datasets. Often these datasets are

Search WWH ::

Custom Search

Home