High Throughput Data Movement - Scientific Data Management

Database Reference

In-Depth Information

the entire data packet to the server. Also, to enable asynchronous communi-

cation the data is buffered before the data transfer request is issued. We use

PBIO 21 to marshal the data into a buffer reserved for DataTap usage. The use

of the buffer consumes some of the memory available to GTC but allows the

application to proceed without waiting for I/O. The application only blocks

for I/O while waiting for a previous I/O request to complete.

Once the DataTap server receives the request, it is queued up locally for

future processing. The queuing of the request is necessary due to the large

imbalance in the total size of the data to be transferred and the amount of

memory available on the service node. For each request the DataTap server

issues an RDMA read request to the originating compute node.

To maximize the bandwidth usage for the application, the DataTap server

issues multiple RDMA read requests concurrently. The number of requests is

predicated on the available memory at the service nodes and the size of the

data being transferred. Also to minimize the perturbation caused by asyn-

chronous I/O, the DataTap server uses a scheduling mechanism so as not to

issue read requests when the application is actively using the network fabric.

Once the data buffer is transferred over, the DataTap server sends the buffer

to the I/O graph for further processing.

DataTap Evaluation To evaluate the eciency and performance of the

DataTap we look at the bandwidth observed at the DataTap server (at the

I/O node). In Figure 5.2 we evaluate the scalability of our two DataTap im-

plementations by looking at the maximum bandwidth achieved during data

transfers. The InfiniBand DataTap (on a Linux Cluster) suffers a performance

degradation due to the lack of a reliable datagram transport in our current

hardware. However, this performance penalty only affects the first iteration

of the data transfer, where connection initiation is performed. Subsequent

transfers use cached connection information for improved performance. For

smaller data sizes the Cray XT3 is significantly faster than the InfiniBand

DataTap. The InfiniBand DataTap offers higher maximum bandwidth due to

more optimized memory handling on the InfiniBand DataTap; we are cur-

rently addressing this for the Cray XT3 version.

In GTC's default I/O pattern, the dominant cost is from each processor's

writing out the local array of particles into a separate file. This corresponds

to writing out something close to 10% of the memory footprint of the code,

with the write frequency chosen so as to keep the average overhead of I/O

within a reasonable percentage of total execution time. As part of the standard

process of accumulating and interpreting this data, these individual files are

then aggregated and parsed into time series, spatially bounded regions, and

so forth, depending on downstream needs.

To demonstrate the utility of structured streams in an application envi-

ronment, we evaluated GTC on a Cray XT3 development cluster at ORNL

with two different input set sizes. For each, we compared GTCs runtime for

three different I/O configurations: no data output, data output to a per-mpi-

process Lustre file, and data output using a DataTap (Table 5.1). We observed

Scientific Data Management

Search WWH ::

Custom Search

Home