Hardware Reference
In-Depth Information
22.2.5 Index Generation
An aggregator in each group is responsible for generating local indexes for
each subfile and the aggregator in the first group (P 0 ) also generates indexes
in the metadata file. This is done by performing a two-round of collective
communications (using sub-communicator) among all processors in order to
improve scalability. First, every rank builds indexes locally and then sends its
portion of the indexes to its aggregator. Once the aggregator receives the data,
it populates the data structures for the process groups (PG), variables and
attributes. These are used to create the footer for the local file. All aggregators
then send their indexes to the highest level aggregator (P 0 ), which writes out
a combined global metadata file for the entire cohort.
22.2.6 Staged Read Method
The subfiling technique can greatly improve write performance, however, it
poses new challenges for reading data. Without carefully managing metadata
operations such as MPI File open() or POSIX open(), subsetting data can
be very expensive, particularly for operations such as planar access. Suppose
there are M MPI processors and N subfiles on disk and each MPI rank needs
to read in a arbitrary plane from a multi-dimensional array, which can be
scattered across all subfiles. There will be M * N file opens simultaneously
issued to Lustre metadata server, which can be overwhelming. Figure 22.3
shows the metadata operation cost that measures the open() and close() time.
It is clear that as the run is scaled up, the metadata cost increases quickly.
Now for a typical S3D post-processing run, say a 2400-core post-processing job
reads in data across subfiles dumped out by a 96,000-core computation job.
Depending on the access pattern, it may issue 4 million file opens. Assuming
each file open takes 0.5 ms, the metadata operation alone will take around
40 minutes. The staged read method aims to tackle the issue by performing
staged file operations, in which only selected MPI processors open/close files.
I/O chunking is another technique used here to achieve bulk reads, similar to
MPI-IO, and details are discussed next.
22.2.7 Staged Opens
Staged opens bring two major advantages to read operations: They ef-
fectively alleviate the \large metadata/footer" problem mentioned early. As
scientific simulations scale up and more analytical data, such as statistics and
indexes will be eventually included in the future, file metadata (such as header,
footer) will continue to expand regardless of individual file formats. Loading
the entire metadata section in before any read/write operation is a common
practice in many parallel I/O libraries. Therefore, as the metadata grows and
cannot be fit into the memory of a compute core, more advanced metadata
management are needed. We can resolve this issue either by adopting a lazy
Search WWH ::




Custom Search