Accelerating Queries on Very Large Datasets - Scientific Data Management

Database Reference

In-Depth Information

Index schemes that eciently process queries on scientific datasets are only

effective if they are built within the framework of the underlying physical data

organization understood by the computational processing model. One example

of a highly successful index method is the bitmap index method, 4 , 21 , 48 , 102 , 103

which is elaborated upon in some detail in Section 6.6. To understand why

traditional DBMSs and their accompanying index methods such as B-tree,

hashing, R-Trees, and so forth, have been less effective in managing scientific

datasets, we examine some of the characteristics of these applications.

Data Organizational Framework: Many of the existing scientific datasets

are stored in custom-formatted files and may come with their own anal-

ysis systems. ROOT is a very successful example of such a system. 18 , 70

Much of astrophysics data are stored in FITS format, 41 and many other

scientific datasets are stored in NetCDF format 61 and HDF format. 42

Most of these formats including FITS, NetCDF, and HDF are designed

to store arrays, which can be thought of as a vertical data organiza-

tion. However, ROOT organizes data as objects and is essentially row-

oriented.

High-Performance Computing (HPC): Data analysis and computa-

tional science applications, for example, Climate Modeling, have ap-

plication codes that run on high-performance computing environments

that involve hundreds or thousands of processors. Often these paral-

lel application codes utilize a library of data structures for hierarchi-

cal structured grids where the grid points are associated with a list of

attribute values. Examples of such applications include finite element,

finite difference, and adaptive mesh refinement (AMR) method. To e-

ciently output the data from the application programs, the data records

are often organized in the same way as they are computed. The analysis

programs have to reorganize them into a coherent, logical view, which

adds some unique challenges for data access methods.

Data-Intensive I/O: Often, highly parallel computations in HPC also per-

form data-intensive data inputs and outputs. A natural approach to

meet the I/O throughput requirements in HPC is the use of parallel

I/O and parallel file systems. To meet the I/O bandwidth requirements

in HPC, the parallel counterparts of data formats such as NetCDF and

HDF/HDF5 are applied to provide consistent partitioning of the dataset

into chunks that are then striped over disks of a parallel file system.

While such partitioning is ecient during computations that produce

the data, the same partitioning is usually inecient for later data anal-

ysis. Reorganization of the data or an index structure is required to

improve the eciency of the data analysis operations.

None-Transactional ACID Properties: Most scientific applications do

not access data for analysis while concurrently updating the same data

records. The new data records are usually added to the data in large

Scientific Data Management

Search WWH ::

Custom Search

Home