High Throughput Data Movement - Scientific Data Management

Database Reference

In-Depth Information

supercomputer applications was simply stored as files for subsequent anal-

ysis, sometimes days or weeks later. However, as the amount of the data

becomes very large and/or the rates at which data is produced or consumed

by supercomputers become very high, this approach no longer works, and

high-throughput data movement techniques are needed.

Consequently, science-driven analytics over the next 20 years must support

high-throughput data movement methods that shield scientists from machine-

level details, such as the throughput achieved by a file system or the network

bandwidth available to move data from the supercomputer site to remote

machines on which the data is analyzed or visualized. Toward this end, we

advocate a new computing environment in which scientists can ask, “What if

I increase the pressure by a factor of 10?” and have the analytics software run

the appropriate methods to examine the effects of such a change without any

further work by the scientist. Since the simulations in which we are interested

run for long periods of time, we can imagine scientists doing in-situ visualiza-

tion during the lifetime of the run. The outcome of this approach is a paradigm

shift in which potentially plentiful computational resources (e.g., multicore

and accelerator technologies) are used to replace scarce I/O (Input/Output)

capabilities by, for instance, introducing high-performance I/O with visual-

ization, without introducing into the simulation code additional visualization

routines.

Such “analytic I/O” eciently moves data from the compute nodes to the

nodes where analysis and visualization is performed and/or to other nodes

where data is written to disk. Furthermore, the locations where analytics are

performed are flexible, with simple filtering or data reduction actions able

to run on compute nodes, data routing or reorganization performed on I/O

nodes, and more generally, with metadata generation (i.e., the generation

of information about data) performed where appropriate to match end-user

requirements. For instance, analytics may require that certain data be identi-

fied and tagged on I/O nodes while it is being moved, so that it can be routed

to analysis or visualization machines. At the same time, for performance and

scalability, other data may be moved to disk in its raw form, to be reorganized

later into file organizations desired by end users. In all such cases, however,

high-throughput data movement is inexorably tied to data analysis, annota-

tion, and cataloging, thereby extracting the information required by end users

from the raw data.

In order to illustrate the high-throughput data requirements associated with

data-intensive computing, we describe next in some detail an example of a real,

large-scale fusion simulation. Fusion simulations are conducted in order to

model and understand the behavior of particles and electromagnetic waves in

tokomaks, which are devices designed to generate electricity from controlled

nuclear fusion that involves the confining and heating of a gaseous plasma

by means of an electric current and magnetic field. There are a few small

devices already in operation, such as DIII-D 1 and NSTX, 2 and a large device

in progress, ITER, 3 being built in southern France.

Scientific Data Management

Search WWH ::

Custom Search

Home