Scaling Up Parallel I/O in S3D to 100-K Cores with ADIOS - High Performance Parallel I/O

Hardware Reference

In-Depth Information

22.2.5 Index Generation

An aggregator in each group is responsible for generating local indexes for

each subfile and the aggregator in the first group (P 0 ) also generates indexes

in the metadata file. This is done by performing a two-round of collective

communications (using sub-communicator) among all processors in order to

improve scalability. First, every rank builds indexes locally and then sends its

portion of the indexes to its aggregator. Once the aggregator receives the data,

it populates the data structures for the process groups (PG), variables and

attributes. These are used to create the footer for the local file. All aggregators

then send their indexes to the highest level aggregator (P 0 ), which writes out

a combined global metadata file for the entire cohort.

22.2.6 Staged Read Method

The subfiling technique can greatly improve write performance, however, it

poses new challenges for reading data. Without carefully managing metadata

operations such as MPI File open() or POSIX open(), subsetting data can

be very expensive, particularly for operations such as planar access. Suppose

there are M MPI processors and N subfiles on disk and each MPI rank needs

to read in a arbitrary plane from a multi-dimensional array, which can be

scattered across all subfiles. There will be M * N file opens simultaneously

issued to Lustre metadata server, which can be overwhelming. Figure 22.3

shows the metadata operation cost that measures the open() and close() time.

It is clear that as the run is scaled up, the metadata cost increases quickly.

Now for a typical S3D post-processing run, say a 2400-core post-processing job

reads in data across subfiles dumped out by a 96,000-core computation job.

Depending on the access pattern, it may issue 4 million file opens. Assuming

each file open takes 0.5 ms, the metadata operation alone will take around

40 minutes. The staged read method aims to tackle the issue by performing

staged file operations, in which only selected MPI processors open/close files.

I/O chunking is another technique used here to achieve bulk reads, similar to

MPI-IO, and details are discussed next.

22.2.7 Staged Opens

Staged opens bring two major advantages to read operations: They ef-

fectively alleviate the \large metadata/footer" problem mentioned early. As

scientific simulations scale up and more analytical data, such as statistics and

indexes will be eventually included in the future, file metadata (such as header,

footer) will continue to expand regardless of individual file formats. Loading

the entire metadata section in before any read/write operation is a common

practice in many parallel I/O libraries. Therefore, as the metadata grows and

cannot be fit into the memory of a compute core, more advanced metadata

management are needed. We can resolve this issue either by adopting a lazy

Search WWH ::

Custom Search

Home