Database Reference
In-Depth Information
Index schemes that eciently process queries on scientific datasets are only
effective if they are built within the framework of the underlying physical data
organization understood by the computational processing model. One example
of a highly successful index method is the bitmap index method, 4 , 21 , 48 , 102 , 103
which is elaborated upon in some detail in Section 6.6. To understand why
traditional DBMSs and their accompanying index methods such as B-tree,
hashing, R-Trees, and so forth, have been less effective in managing scientific
datasets, we examine some of the characteristics of these applications.
Data Organizational Framework: Many of the existing scientific datasets
are stored in custom-formatted files and may come with their own anal-
ysis systems. ROOT is a very successful example of such a system. 18 , 70
Much of astrophysics data are stored in FITS format, 41 and many other
scientific datasets are stored in NetCDF format 61 and HDF format. 42
Most of these formats including FITS, NetCDF, and HDF are designed
to store arrays, which can be thought of as a vertical data organiza-
tion. However, ROOT organizes data as objects and is essentially row-
oriented.
High-Performance Computing (HPC): Data analysis and computa-
tional science applications, for example, Climate Modeling, have ap-
plication codes that run on high-performance computing environments
that involve hundreds or thousands of processors. Often these paral-
lel application codes utilize a library of data structures for hierarchi-
cal structured grids where the grid points are associated with a list of
attribute values. Examples of such applications include finite element,
finite difference, and adaptive mesh refinement (AMR) method. To e-
ciently output the data from the application programs, the data records
are often organized in the same way as they are computed. The analysis
programs have to reorganize them into a coherent, logical view, which
adds some unique challenges for data access methods.
Data-Intensive I/O: Often, highly parallel computations in HPC also per-
form data-intensive data inputs and outputs. A natural approach to
meet the I/O throughput requirements in HPC is the use of parallel
I/O and parallel file systems. To meet the I/O bandwidth requirements
in HPC, the parallel counterparts of data formats such as NetCDF and
HDF/HDF5 are applied to provide consistent partitioning of the dataset
into chunks that are then striped over disks of a parallel file system.
While such partitioning is ecient during computations that produce
the data, the same partitioning is usually inecient for later data anal-
ysis. Reorganization of the data or an index structure is required to
improve the eciency of the data analysis operations.
None-Transactional ACID Properties: Most scientific applications do
not access data for analysis while concurrently updating the same data
records. The new data records are usually added to the data in large
Search WWH ::




Custom Search