Parallel Data Storage and Access - Scientific Data Management

Database Reference

In-Depth Information

2.4.4 High-Level I/O Libraries

Files are usually considered as a linear sequence of bytes by most of the

file systems. Applications are responsible for interpreting the bytes into log-

ical structures, for instance a two-dimensional array of floating-point num-

bers. Without metadata to describe the logical data structures, a program

has di culty telling what the bytes represent. Therefore, in order to ensure

portability, a file's metadata must accompany the file at all times. This re-

quirement is particularly important for scientific data because many scientific

data libraries, such as for visualization and data mining, manipulate data at

a higher level than byte streams.

This section describes two popular scientific data libraries, parallel netCDF

and HDF5. Both libraries store metadata along with data in the same files.

In addition, both define their own file formats and a set of APIs to access the

files, sequentially as well as in parallel.

2.4.4.1

Parallel netCDF

The network common data form (netCDF) was developed at the Unidata

Program Center. 17 , 18 The goal of netCDF is to define a portable file format

so that scientists can share data across different machine platforms. Atmo-

spheric science applications, for example, use netCDF to store a variety of

data types that encompass single-point observations, time series, regularly

spaced grids, and satellite or radar images. 19 Many organizations, including

much of the climate community, rely on the netCDF data access standard

for data storage. 20 However, netCDF does not provide adequate parallel I/O

methods. For parallel write to a shared netCDF file, applications must serial-

ize access by passing all the data to a single process that then writes to the

file. The serial I/O access is both slow and cumbersome for the application

programmer. A new set of parallel programming interfaces for netCDF files,

parallel netCDF (PnetCDF), therefore has been developed. 13

The netCDF file format follows the common data form language (CDL)

suitable for interpreting data for human readers. It divides a netCDF file into

two parts: file header and body. The header contains all information about

dimensions, attributes, and scalar variables, followed by the body part con-

taining arrays of variable values in binary form. The netCDF file header first

defines a number of dimensions, each with a name and a length, which can

be used to describe the shapes of arrays. The most significant dimension of

a multidimensional array can be unlimited for arrays of growing size. Global

attributes not associated with any particular array can also be added to the

head. This feature allows programmers' annotation and other related informa-

tion to be added to increase the file's readability. The body part of a netCDF

file first stores the fixed-size arrays followed by the variable-sized arrays. For

storing a variable-sized array, netCDF defines each subarray comprising all

the fixed dimensions as a record, and the records are stored interleaved. All

offsets of fixed-size and variable-size arrays are properly saved in the header.

Scientific Data Management

Search WWH ::

Custom Search

Home