Scientific Data Management Challenges in High-Performance Visual Data Analysis - Scientific Data Management

Database Reference

In-Depth Information

1. The underlying I/O infrastructure only supports a partitioning scheme

fixed by the simulation when the file(s) were created. Most of the time,

this scenario corresponds to having one atomic chunk of data for each

processor. ∗ Examples of I/O libraries of this type are Silo 6 and Exodus. 7

Other examples include file-per-processor output, which may be a good

way to achieve I/O performance for the simulation's data-write phase,

but has undesirable consequences for processing tools that read their

data. Those processing tools are then forced to reconcile between the

simulation's degree of parallelism and their own. For example, the sim-

ulation may decompose a three-dimensional space into 1,000 pieces, but

the processing tool may be running with only five processors. In this

case, the processing tool must find a way to partition the pieces across

its processors, either by combining all of the pieces on a given proces-

sor into one large piece or by respecting piece layout and supporting

multiple pieces per processor.

2. The underlying I/O infrastructure supports re-partitioning during read.

Most of the time, this scenario corresponds to having all of the data

in one large file, with the I/O infrastructure supporting operations like

hyperslab reads, collective I/O, and so forth. Examples of formats that

can repartition data in this manner are ViSUS, 8 SAF, 9 and HDF5. 10

These two scenarios are well supported by the major, parallel, production

visualization tools, although the scenarios are supported differently in terms of

how the tools do parallel partitioning. For the first case (imposed partitioning),

each subset of the partition normally consists of domains , where each domain

consists of the portion operated on by a single processor. In this case, the

visualization tool distributes the domains across its processors. For the second

case (adaptive partitioning), the visualization tool forms its own partition of

the dataset by having each processor read in a unique piece. In both cases, it

is important that each processor has an approximately equal amount of data

to read, which correlates strongly with work to be performed in subsequent

stages. From a scientific data management (SDM) perspective, the summary

is that both ways of writing data are acceptable.

9.2.1.2

Processing

The modern parallel visualization tools all use a data flow network processing

design. 11 - 13 Data flow networks have base types of data objects and components

(sometimes called process objects). The components can be filters , sources ,or

sinks . Filters have an input and an output, both of which are data objects.

Sources have only data object outputs, while sinks have only data object

inputs. A pipeline is an ordered collection of components. Each pipeline has

a source (typically a file reader) followed by one or more filters (for example

∗ Atomic in the sense that partial reads of the chunk of data are not possible.

Search WWH ::

Custom Search

Home