Parallel Data Storage and Access - Scientific Data Management

Database Reference

In-Depth Information

and body parts. There are four essential classes of information in a header:

name, data type, dataspace, and storage layout. The name of the array is a

text string. The data type describes the numerical type of array elements,

which can be atomic, native, compound, or named. Atomic data types are

the primitive data types, such as integers and floats. Native data types are

system-specific instances of atomic data types. Compound data types are col-

lections of atomic data types. Named data types are either atomic or com-

pound data types that can be shared across arrays. A dataspace depicts the

dimensionality of an array. Unlike netCDF, all dimensions of an HDF5 ar-

ray can be either fixed or unlimited. The storage layout specifies the way

a multidimensional array is stored in a file. The default storage layout for-

mat is contiguous, meaning that data is stored in the same linear way that

it is organized in memory. The other storage layout is called chunked, in

which an array is divided into equal-sized chunks, and chunks are stored sep-

arately in the file. Chunking has three important benefits. First, it provides

the possibility to achieve good performance when accessing noncontiguous

subsets of the arrays. Secondly, it allows large array compression. Third, it

enables the dimension extension of an array in any direction. The chunk-

ing is also applicable to headers. Therefore, the HDF5 headers can be dis-

persed in separate header blocks for each object, not limited to the beginning

of the file. Another important feature of HDF5 is called grouping . A collec-

tion of arrays can be grouped together, and a group may contain a number

of arrays and other groups that are organized in a tree-based hierarchical

structure.

HDF5 APIs are divided into 12 categories: general purpose, attributes,

datasets, error handling, file access, grouping, object identifiers, property list,

references, data-space, data type, and filter/compression. Writing an HDF5

file comprises the following steps: create a file, create groups, define datas-

paces, define data types, create arrays, write attributes, write array data, and

close the file. The parallel data access support in HDF5 is built on top of

MPI-IO to ensure the file's portability. However, HDF5 does not separate

its file access routines into collective and independent versions as MPI-IO

does. Parallel I/O is enabled through the setting of properties passed to the

file open and data access APIs. The properties tell HDF5 to perform I/O

collectively or independently. Similar to PnetCDF, HDF5 allows accessing

a subarray in a single I/O call, and it is achieved through defining hyper-

slabs in the dataspace. HDF5's chunking allows writing subarrays without

reorganizing them into a global canonical order in the file. The advantage of

chunking becomes significant when the storage layout is orthogonal to the

access layout, for example, storage layout of a two-dimensional array being in

the row major and the access pattern being in the column major. However,

this high degree of flexibility in HDF5 can sometimes come at the cost of high

performance. 13 , 27

Scientific Data Management

Search WWH ::

Custom Search

Home