Livermore Computing Center - High Performance Parallel I/O

Hardware Reference

In-Depth Information

largest problem, and the network probably constrains only the top speed

(which no applications achieve in practice). For the case where an application

is running on a partition smaller than the entire system, and the applicaiton

will not be able to saturate bandwidth capability (as is possible with tens of

client nodes on a TLCC2 system), there will be file system bandwidth left for

each partition in the system.

Present application designers are adopting an N M strategy where les

are shared over a subset of the N compute processes, resulting in a set of M

files smaller than the number of compute processes. This gives some latitude

to adapt to metadata performance constraints. Future applications will have

to move away from the POSIX model to achieve greater performance. For the

BG/L generation of system, the practical sweet spot was that for N processes,

M was the number of I/O nodes in the compute partition running N processes.

This seemed to hold for the BG/P generation. For the BG/Q generation, a

good starting point seems to be having M be the number of compute nodes

(about 1/16 of the total process count).

5.7.2 Recommendations to Application Developers

At LLNL, LC is often called upon to advise application developers and

users of various application codes how to eciently interact with the Lustre

file systems. Before discussing a couple of specific efforts along these lines,

here is a high-level outline of the major points:

Minimize I/O. Although this may seem obvious, there is a temptation

to save all data that might be of use. This is an issue for the overall

performance of the code.

Minimize opens and closes. Opens and closes are expensive operations

which hit the metadata server. It is best to open a file once during the

execution of a code, and to close it once.

Aggregate data. Lustre is designed around ecient transfers of larger

blocks of data, around 1 MB in the case of Sequoia and the Lustre file

system attached to it. Aggregating transfers of data into larger blocks

of contiguous data is a good practice. For example, a subset of tasks can

collect the data over the network, and then write fewer, larger, aggregate

chunks rather than having each MPI task writing its own smaller chunk

of data.

Align data. I/O transfers that are aligned on 1-MB boundaries is another

good practice.

Now consider two examples of application efforts at LLNL that use the

Lustre file systems more eciently: SILO and scalable checkpoint/restart.

Search WWH ::

Custom Search

Home