Database Reference
In-Depth Information
supercomputer applications was simply stored as files for subsequent anal-
ysis, sometimes days or weeks later. However, as the amount of the data
becomes very large and/or the rates at which data is produced or consumed
by supercomputers become very high, this approach no longer works, and
high-throughput data movement techniques are needed.
Consequently, science-driven analytics over the next 20 years must support
high-throughput data movement methods that shield scientists from machine-
level details, such as the throughput achieved by a file system or the network
bandwidth available to move data from the supercomputer site to remote
machines on which the data is analyzed or visualized. Toward this end, we
advocate a new computing environment in which scientists can ask, “What if
I increase the pressure by a factor of 10?” and have the analytics software run
the appropriate methods to examine the effects of such a change without any
further work by the scientist. Since the simulations in which we are interested
run for long periods of time, we can imagine scientists doing in-situ visualiza-
tion during the lifetime of the run. The outcome of this approach is a paradigm
shift in which potentially plentiful computational resources (e.g., multicore
and accelerator technologies) are used to replace scarce I/O (Input/Output)
capabilities by, for instance, introducing high-performance I/O with visual-
ization, without introducing into the simulation code additional visualization
routines.
Such “analytic I/O” eciently moves data from the compute nodes to the
nodes where analysis and visualization is performed and/or to other nodes
where data is written to disk. Furthermore, the locations where analytics are
performed are flexible, with simple filtering or data reduction actions able
to run on compute nodes, data routing or reorganization performed on I/O
nodes, and more generally, with metadata generation (i.e., the generation
of information about data) performed where appropriate to match end-user
requirements. For instance, analytics may require that certain data be identi-
fied and tagged on I/O nodes while it is being moved, so that it can be routed
to analysis or visualization machines. At the same time, for performance and
scalability, other data may be moved to disk in its raw form, to be reorganized
later into file organizations desired by end users. In all such cases, however,
high-throughput data movement is inexorably tied to data analysis, annota-
tion, and cataloging, thereby extracting the information required by end users
from the raw data.
In order to illustrate the high-throughput data requirements associated with
data-intensive computing, we describe next in some detail an example of a real,
large-scale fusion simulation. Fusion simulations are conducted in order to
model and understand the behavior of particles and electromagnetic waves in
tokomaks, which are devices designed to generate electricity from controlled
nuclear fusion that involves the confining and heating of a gaseous plasma
by means of an electric current and magnetic field. There are a few small
devices already in operation, such as DIII-D 1 and NSTX, 2 and a large device
in progress, ITER, 3 being built in southern France.
Search WWH ::




Custom Search