Scaling Up Parallel I/O in S3D to 100-K Cores with ADIOS - High Performance Parallel I/O - page 264

Hardware Reference

In-Depth Information

in the high-end computing systems are becoming faster and faster. For ex-

ample, the latest Cray Gemini interconnect can sustain up to 20 GB/s [1].

(2) The issue with doing the MPI collective operation in MPI-IO is not the

sheer volume of data to exchange. Instead, the dominating factor that slows

down application performance is the frequency of collective operations and

possibility of lock contention. Earlier work [4] shows that MPI Bcast is called

314,800 times in the Chimera run, which take 25% of the wall clock time.

(3) The collective operation in ADIOS is done in a very controlled manner.

All MPI processors are split into sub-groups and aggregation is done within a

sub-communicator. Therefore, the interference between groups is minimized.

Meanwhile, indexes are also generated first within a group and then sent by

all the aggregator processors to a root processor (e.g., rank 0) to avoid global

collectives. (4) Most of today's computing resources such as Jaguar Cray XT5

use multicore CPU, and aggregation among the cores within a single chip is

inexpensive as the cost is close to that of the memcpy() operation.

Listing 22.1: Example ADIOS code.

a d i o s o p le n

(& a d i o s h a n d l le ,

" a n a l y s i s " ,

f i l le n a m le ,

"w" , &

comm) ;

a d i o s w r i t e

( a d i o s h a n d l e ,

"NX" , &NX) ;

a d i o s w r i t e

( a d i o s h a n d l e ,

"NY" , &NY) ;

a d i o s w r i t e

( a d i o s h a n d l e ,

" t e m p e r a t u r e " ,

t ) ;

a d i o s

c l o s e

( a d i o s h a n d l e ) ;

Within a group, an aggregator gathers the buffered PGs from all of its

members, provided that there is sucient memory on the aggregator proces-

sor. Depending on the communication pattern, an aggregator can either per-

form all-to-one communication (i.e., MPIGather() ) or brigade-like communi-

cation (see Figure 22.1). For the former case, all members send data addressed

directly to the aggregator processor. For the latter case, a member sends its

data to its upstream processor. As a result, while an aggregator writes data

from one member, data from another member will be moved closer. Therefore,

the communication cost can be minimized. The idea of brigade aggregation

is to overlap MPI communication with disk I/O and achieve streaming-like

I/O. Next, each aggregator writes out all data that it receives to a subfile.

The subfile is striped on a single OST (Object Storage Target) to minimize

the potential write lock contention between aggregators. A global metadata

file is also written out from P 0 to make reading data subfiles possible.

One challenge of one file per process (N-N) is the overwhelming meta-

data pressure resulting from the simultaneous creation of tens of thousands

of files. And one file (N-1) pattern often results in unaligned access across file

system boundaries, which in turn causes write lock contention among proces-

sors. The aggregation scheme offers a flexible N-M pattern that overcomes the

drawbacks from both N-N and N-1 via a reduced number of files and larger

writes.

Next Page

High Performance Parallel I/O

Search WWH ::

Custom Search

Home