Conclusions and Future Outlook - Scientific Data Management

Database Reference

In-Depth Information

the data is stored and from interacting with the workflow system directly. The

analysis facility should take care of the storage of the data, execution of the

workflow, translation of formats between steps to match each step's input

and output requirements, and generation of products for the user to track

and monitor progress of the analysis process. This suggests the concept of an

“abstract workflow” that is mapped into “concrete (or executable) workflows”

as described in Chapter 13. We expect such facilities to become the norm in

conducting scientific analysis.

In general, data-side analysis facilities will not eliminate replicating data.

One would expect that important data, which large communities share, will be

replicated to multiple sites, each providing its own data-side analysis facility.

For example, climate modeling data generated by long runs on supercom-

puters will most likely be mirrored to multiple sites worldwide. Similarly, it

is expected that some subsets of data will still be moved to scientists' sites,

as cost of cluster hardware continues to fall, and networking speed grows.

As cloud computing and storage grows in use, it is expected that data-side

analysis will be offered on cloud facilities as well.

Scientific Database Management Systems

Historically, the concept of separating the logical organization of the data from

its physical organization dominated the development of database management

systems (DBMSs). This is referred to as “physical data independence”. The

logical organization referred to “what” is the structure of the data, and the

physical organization referred to “how” the data is stored and organized on

physical media, including memory and disk. In order to access the data faster,

different types of storage organization and indexes were invented, which did

not affect the logical organization of the data. Such concepts brought about

the use of DBMSs in many areas, especially in business and commercial ap-

plication domains.

The dominant DBMS system today is still the relational database system.

Its simple data model of representing data as tables, where rows represent in-

stances of objects (such as people, books, etc.), and columns represent proper-

ties (or attributes) of the objects, made it very attractive to many applications.

However, by and large, relational database systems have not been used ex-

tensively by scientific applications. Instead, most large scientific datasets are

stored as files in specific standard file formats, such as NetCDF and HDF5.

There are several reasons for this state of affairs. First, it is the desire of sci-

entists to exchange data by simply sending each other files. By agreeing on a

standard file format for some communities, and even including the metadata

in the header of the files, the files became “self-describing”. Second, there is

Search WWH ::

Custom Search

Home