Accelerating Queries on Very Large Datasets - Scientific Data Management

Database Reference

In-Depth Information

chunks. This allows the data management system to treat access con-

trol in a much more optimistic manner than is possible with traditional

DBMS systems. This feature will be particularly important as data man-

agement systems evolve to take advantage of multicore architectures and

clusters of such multicore computers, where concurrent accesses to data

is a necessity.

More discussion about the differences between scientific and commercial

DBMSs is presented in Section 7.6 in the context of SciDB.

6.3 A Taxonomy of Index Methods

An access method defines a data organization, the data structures, and the

algorithms for accessing individual data items that satisfy some query criteria.

For example, given N records, each with k attributes, one very simple access

method is that of a sequential scan. The records are stored in N consecutive

locations, and for any query the entire set of records is examined one after the

other. For each record, the query condition is evaluated; and if the condition

is satisfied, the record is reported as a hit of the query. The data organization

for such a sequential scan is called the heap . A general strategy to accelerate

this process is to augment the heap with an index scheme .

An index scheme is the data structure and its associated algorithms that im-

prove the data accesses such as insertions, deletions, retrievals, and query pro-

cessing. The usage and preference of an index scheme for accessing a dataset

is highly dependent on a number of factors including the following:

Dataset Size: One factor is whether the data can be contained entirely in

memory or not. Since our focus is on massively large scientific datasets,

we will assume the latter with some consideration for main memory

indexes when necessary.

Data Organization: The datasets may be organized into fixed-size data

blocks (also referred to as data chunks or buckets at times). A data

block is typically defined as a multiple of the physical page size of disk

storage. A data organization may be defined to allow for future inser-

tions and deletions without impacting the speed of accessing data by the

index scheme. On the other hand, the data may be organized and con-

strained to be read-only , append-only , or both. Another influencing data

organization factor is whether the records are of fixed length or variable

length. Of particular interest in scientific datasets are those datasets

that are mapped into very large k-dimensional arrays. To partition the

array into manageable units for transferring between memory and disk

storage, fixed-size subarrays called chunks are used. Examples of such

data organization methods are NetCDF, 61 HDF5 42 and FITS. 41

Search WWH ::

Custom Search

Home