Database Reference
In-Depth Information
the entire data packet to the server. Also, to enable asynchronous communi-
cation the data is buffered before the data transfer request is issued. We use
PBIO 21 to marshal the data into a buffer reserved for DataTap usage. The use
of the buffer consumes some of the memory available to GTC but allows the
application to proceed without waiting for I/O. The application only blocks
for I/O while waiting for a previous I/O request to complete.
Once the DataTap server receives the request, it is queued up locally for
future processing. The queuing of the request is necessary due to the large
imbalance in the total size of the data to be transferred and the amount of
memory available on the service node. For each request the DataTap server
issues an RDMA read request to the originating compute node.
To maximize the bandwidth usage for the application, the DataTap server
issues multiple RDMA read requests concurrently. The number of requests is
predicated on the available memory at the service nodes and the size of the
data being transferred. Also to minimize the perturbation caused by asyn-
chronous I/O, the DataTap server uses a scheduling mechanism so as not to
issue read requests when the application is actively using the network fabric.
Once the data buffer is transferred over, the DataTap server sends the buffer
to the I/O graph for further processing.
DataTap Evaluation To evaluate the eciency and performance of the
DataTap we look at the bandwidth observed at the DataTap server (at the
I/O node). In Figure 5.2 we evaluate the scalability of our two DataTap im-
plementations by looking at the maximum bandwidth achieved during data
transfers. The InfiniBand DataTap (on a Linux Cluster) suffers a performance
degradation due to the lack of a reliable datagram transport in our current
hardware. However, this performance penalty only affects the first iteration
of the data transfer, where connection initiation is performed. Subsequent
transfers use cached connection information for improved performance. For
smaller data sizes the Cray XT3 is significantly faster than the InfiniBand
DataTap. The InfiniBand DataTap offers higher maximum bandwidth due to
more optimized memory handling on the InfiniBand DataTap; we are cur-
rently addressing this for the Cray XT3 version.
In GTC's default I/O pattern, the dominant cost is from each processor's
writing out the local array of particles into a separate file. This corresponds
to writing out something close to 10% of the memory footprint of the code,
with the write frequency chosen so as to keep the average overhead of I/O
within a reasonable percentage of total execution time. As part of the standard
process of accumulating and interpreting this data, these individual files are
then aggregated and parsed into time series, spatially bounded regions, and
so forth, depending on downstream needs.
To demonstrate the utility of structured streams in an application envi-
ronment, we evaluated GTC on a Cray XT3 development cluster at ORNL
with two different input set sizes. For each, we compared GTCs runtime for
three different I/O configurations: no data output, data output to a per-mpi-
process Lustre file, and data output using a DataTap (Table 5.1). We observed
Search WWH ::




Custom Search