Hardware Reference
In-Depth Information
in the high-end computing systems are becoming faster and faster. For ex-
ample, the latest Cray Gemini interconnect can sustain up to 20 GB/s [1].
(2) The issue with doing the MPI collective operation in MPI-IO is not the
sheer volume of data to exchange. Instead, the dominating factor that slows
down application performance is the frequency of collective operations and
possibility of lock contention. Earlier work [4] shows that MPI Bcast is called
314,800 times in the Chimera run, which take 25% of the wall clock time.
(3) The collective operation in ADIOS is done in a very controlled manner.
All MPI processors are split into sub-groups and aggregation is done within a
sub-communicator. Therefore, the interference between groups is minimized.
Meanwhile, indexes are also generated first within a group and then sent by
all the aggregator processors to a root processor (e.g., rank 0) to avoid global
collectives. (4) Most of today's computing resources such as Jaguar Cray XT5
use multicore CPU, and aggregation among the cores within a single chip is
inexpensive as the cost is close to that of the memcpy() operation.
Listing 22.1: Example ADIOS code.
a d i o s o p le n
(& a d i o s h a n d l le ,
" a n a l y s i s " ,
f i l le n a m le ,
"w" , &
comm) ;
a d i o s w r i t e
( a d i o s h a n d l e ,
"NX" , &NX) ;
a d i o s w r i t e
( a d i o s h a n d l e ,
"NY" , &NY) ;
a d i o s w r i t e
( a d i o s h a n d l e ,
" t e m p e r a t u r e " ,
t ) ;
a d i o s
c l o s e
( a d i o s h a n d l e ) ;
Within a group, an aggregator gathers the buffered PGs from all of its
members, provided that there is sucient memory on the aggregator proces-
sor. Depending on the communication pattern, an aggregator can either per-
form all-to-one communication (i.e., MPIGather() ) or brigade-like communi-
cation (see Figure 22.1). For the former case, all members send data addressed
directly to the aggregator processor. For the latter case, a member sends its
data to its upstream processor. As a result, while an aggregator writes data
from one member, data from another member will be moved closer. Therefore,
the communication cost can be minimized. The idea of brigade aggregation
is to overlap MPI communication with disk I/O and achieve streaming-like
I/O. Next, each aggregator writes out all data that it receives to a subfile.
The subfile is striped on a single OST (Object Storage Target) to minimize
the potential write lock contention between aggregators. A global metadata
file is also written out from P 0 to make reading data subfiles possible.
One challenge of one file per process (N-N) is the overwhelming meta-
data pressure resulting from the simultaneous creation of tens of thousands
of files. And one file (N-1) pattern often results in unaligned access across file
system boundaries, which in turn causes write lock contention among proces-
sors. The aggregation scheme offers a flexible N-M pattern that overcomes the
drawbacks from both N-N and N-1 via a reduced number of files and larger
writes.
 
Search WWH ::




Custom Search