Hardware Reference
In-Depth Information
20
aggregation ratio 50:1
aggregation ratio 10:1
no aggregation N:N (POSIX)
no aggregation N:1 (MPI_LUSTRE)
15
10
5
0
100
1000
10000
100000
Processors
FIGURE 22.2: S3D write performance scaling as application size is increased.
Note the change in best performance from 10:1 aggregation to 50:1 aggregation
as application size hits 96,000 cores.
To evaluate the staged write performance, we ran S3D on JaguarPF vary-
ing from a 96-core run to a 96,000-core run in a weak scaling configuration (see
Figure 22.2). Each data point in the figure is the average of the results col-
lected in ten consecutive runs. Clearly, N-N and N-1 with no aggregation can
only achieve a few GB/s for all the runs. POSIX is quite good for the smallest
96-core run but the write speed decreases as the run scales up. MPI
LUSTRE
is an I/O method in ADIOS that writes out one file with the Lustre stripe
aligned. Its write performance versus POSIX improves slightly as the number
of cores increases. However, at the 96,000-core run, the performance drops due
to heavy contention. For 10:1 ratio aggregation, the speed is much improved
versus no aggregation schemes but also with a significant drop at the 96,000-
core run. The reason is that, with 10:1 ratio, the number of subfiles generated
in the 96,000-core job overwhelms the metadata server (similar to POSIX)
and file open/close takes a much longer time to finish. We subsequently in-
creased the ratio to 50:1. Although it has lower performance (due to the fact
that it didn't fully utilize all the OSTs because of the lower number of ag-
gregators) than 10:1 at low core count, at 96,000-core run, it yields a much
higher throughput than the 10:1 ratio.
 
Search WWH ::




Custom Search