Hardware Reference
In-Depth Information
to be granted. Thus, the achievable performance is significantly limited by the
conflicted file locks.
To overcome such problems, MPI collective I/O functions can convey the
exact user intent in this example that considers the parallel write of the whole
global array as a single request. The program fragment using a collective write
function is shown in Figure 13.4(d). It appears almost the same as the indepen-
dent case, except the name of the write function. MPI collective I/O requires
the participation of all processes that open the shared file. This requirement
provides a collective I/O implementation an opportunity to exchange access
information and reorganize I/O requests among the processes. Several process-
collaboration strategies have been proposed, such as two-phase I/O [2], disk
directed I/O [4], and server-directed I/O [8].
Two-phase I/O is a representative collaborative I/O technique that runs
in the user space [11]. It exchanges data among processes, so that the re-
arranged requests can be processed by the underlying file system with the
best performance. Two-phase I/O conceptually consists of a request aggrega-
tion phase (or referred as the communication phase) and a file access phase
(or simply the I/O phase). In the request aggregation phase, a subset of MPI
processes is picked as I/O aggregators that act as I/O proxies for the rest
of the processes. The aggregate file access region requested by all processes
is divided among the aggregators into non-overlapping sections, called file
domains. For collective writes, the non-aggregator processes send their re-
quests to the aggregators based on their file domains. In the file access phase,
each aggregator commits the aggregated requests to the file system. ROMIO
adopts the two-phase I/O strategy for implementing the collective MPI-IO
functions.
The bottom part of Figure 13.3 depicts the two-phase I/O operation for
the 2D array example. Assuming P 0 and P 2 are chosen as I/O aggregators, the
aggregate access region of the collective write operation is evenly divided into
two file domains, one for each aggregator. I/O data are redistributed from all
four processes to the two aggregators during the request aggregation phase.
Specifically, aggregator process rank 0 receives data from both rank 0 and 1,
while aggregator rank 2 receives data from ranks 1, 2, and 3. In the file access
phase, each aggregator aggregates the received data into a single, contiguous
request and then makes a write call to the file system.
Recently, there were several optimizations that further improve the perfor-
mance of collective I/O. Various file domain partitioning methods have been
studied that can be adaptively determined based on the file locking policies
of the underlying file systems in order to minimize lock conflicts for collec-
tive I/O [5]. In Sehrish et al.'s work [9], a pipelined strategy was developed
to overlap the two phases in the two-phase I/O method. In this work, large
requests are divided into smaller ones, each of size equal to the file stripe size,
and redistributed to the I/O aggregators using MPI asynchronous communi-
cation in a pipeline fashion so that the asynchronous communication overlaps
 
Search WWH ::




Custom Search