Accelerating Queries on Very Large Datasets - Scientific Data Management

Database Reference

In-Depth Information

DBMS system. This may give rise to different types of data access methods

and different ways of organizing them as well.

Consider a typical database in astrophysics. The archived data include ob-

servational parameters such as the detector, the type of observation, coor-

dinates, astronomical object, exposure time, and so forth. Besides the use

of data-mining techniques to identify features, users need to perform queries

based on physical parameters such as magnitude of brightness, redshift, spec-

tral indexes, morphological type of galaxies, photometric properties, and so

forth, to easily discover the object types contained in the archive. The search

usually can be expressed as constraints on some of these properties, and the

objects satisfying the conditions are retrieved and sent downstream to other

processing steps such as statistics gathering and visualization.

The datasets from most scientific domains (with the possible exception of

bioinformatics and genome data), can be mostly characterized as time-varying

arrays. Each element of the array often corresponds to some attribute of the

points or cells in two- or three-dimensional space. Examples of such attributes

are temperature, pressure, wind velocity, moisture, cloud cover, and so on in

a climate model. Datasets encountered in scientific data management can be

characterized along three principle dimensions:

Size: This the number of data records maintained in the database. Scientific

datasets are typically very large and grow over time to be terabytes or

petabytes. This translates to millions or billions of data records. The

data may span hundreds to thousands of disk storage units and often

are archived on robotic tapes.

Dimensionality: The number of searchable attributes of the datasets may

be quite large. Often, a data record can have a large number of at-

tributes, and scientists may want to conduct searches based on dozens

or hundreds of attributes. For example, a record of a high-energy colli-

sion in the STAR experiment 87 is about 5 MB in size, and the physicists

involved in the experiment have decided to make 200 or so high-level

attributes searchable. 101

Time: This concerns the rate at which the data content evolves over time.

Often, scientific datasets are constrained to be append-only as opposed

to frequent random insertions and deletions as typically encountered in

commercial databases.

Traditional DBMSs such as ORACLE, Sybase, and Objectivity have not had

much success in scientific data management. These have had only limited ap-

plications. For example a traditional relational DBMSs, MySQL, is used to

manage the metadata, while the principal datasets are managed by domain-

specific DBMSs such as ROOT. 18 , 70 It has been argued by Gray et al. 35

that managing the metadata with a nonprocedural data manipulation lan-

guage combined with data indexing is essential when analyzing scientific

datasets.

Scientific Data Management

Search WWH ::

Custom Search

Home