MPI-IO - High Performance Parallel I/O

Hardware Reference

In-Depth Information

to be granted. Thus, the achievable performance is significantly limited by the

conflicted file locks.

To overcome such problems, MPI collective I/O functions can convey the

exact user intent in this example that considers the parallel write of the whole

global array as a single request. The program fragment using a collective write

function is shown in Figure 13.4(d). It appears almost the same as the indepen-

dent case, except the name of the write function. MPI collective I/O requires

the participation of all processes that open the shared file. This requirement

provides a collective I/O implementation an opportunity to exchange access

information and reorganize I/O requests among the processes. Several process-

collaboration strategies have been proposed, such as two-phase I/O [2], disk

directed I/O [4], and server-directed I/O [8].

Two-phase I/O is a representative collaborative I/O technique that runs

in the user space [11]. It exchanges data among processes, so that the re-

arranged requests can be processed by the underlying file system with the

best performance. Two-phase I/O conceptually consists of a request aggrega-

tion phase (or referred as the communication phase) and a file access phase

(or simply the I/O phase). In the request aggregation phase, a subset of MPI

processes is picked as I/O aggregators that act as I/O proxies for the rest

of the processes. The aggregate file access region requested by all processes

is divided among the aggregators into non-overlapping sections, called file

domains. For collective writes, the non-aggregator processes send their re-

quests to the aggregators based on their file domains. In the file access phase,

each aggregator commits the aggregated requests to the file system. ROMIO

adopts the two-phase I/O strategy for implementing the collective MPI-IO

functions.

The bottom part of Figure 13.3 depicts the two-phase I/O operation for

the 2D array example. Assuming P 0 and P 2 are chosen as I/O aggregators, the

aggregate access region of the collective write operation is evenly divided into

two file domains, one for each aggregator. I/O data are redistributed from all

four processes to the two aggregators during the request aggregation phase.

Specifically, aggregator process rank 0 receives data from both rank 0 and 1,

while aggregator rank 2 receives data from ranks 1, 2, and 3. In the file access

phase, each aggregator aggregates the received data into a single, contiguous

request and then makes a write call to the file system.

Recently, there were several optimizations that further improve the perfor-

mance of collective I/O. Various file domain partitioning methods have been

studied that can be adaptively determined based on the file locking policies

of the underlying file systems in order to minimize lock conflicts for collec-

tive I/O [5]. In Sehrish et al.'s work [9], a pipelined strategy was developed

to overlap the two phases in the two-phase I/O method. In this work, large

requests are divided into smaller ones, each of size equal to the file stripe size,

and redistributed to the I/O aggregators using MPI asynchronous communi-

cation in a pipeline fashion so that the asynchronous communication overlaps

High Performance Parallel I/O

Search WWH ::

Custom Search

Home