Hardware Reference
In-Depth Information
largest problem, and the network probably constrains only the top speed
(which no applications achieve in practice). For the case where an application
is running on a partition smaller than the entire system, and the applicaiton
will not be able to saturate bandwidth capability (as is possible with tens of
client nodes on a TLCC2 system), there will be file system bandwidth left for
each partition in the system.
Present application designers are adopting an N M strategy where les
are shared over a subset of the N compute processes, resulting in a set of M
files smaller than the number of compute processes. This gives some latitude
to adapt to metadata performance constraints. Future applications will have
to move away from the POSIX model to achieve greater performance. For the
BG/L generation of system, the practical sweet spot was that for N processes,
M was the number of I/O nodes in the compute partition running N processes.
This seemed to hold for the BG/P generation. For the BG/Q generation, a
good starting point seems to be having M be the number of compute nodes
(about 1/16 of the total process count).
5.7.2 Recommendations to Application Developers
At LLNL, LC is often called upon to advise application developers and
users of various application codes how to eciently interact with the Lustre
file systems. Before discussing a couple of specific efforts along these lines,
here is a high-level outline of the major points:
Minimize I/O. Although this may seem obvious, there is a temptation
to save all data that might be of use. This is an issue for the overall
performance of the code.
Minimize opens and closes. Opens and closes are expensive operations
which hit the metadata server. It is best to open a file once during the
execution of a code, and to close it once.
Aggregate data. Lustre is designed around ecient transfers of larger
blocks of data, around 1 MB in the case of Sequoia and the Lustre file
system attached to it. Aggregating transfers of data into larger blocks
of contiguous data is a good practice. For example, a subset of tasks can
collect the data over the network, and then write fewer, larger, aggregate
chunks rather than having each MPI task writing its own smaller chunk
of data.
Align data. I/O transfers that are aligned on 1-MB boundaries is another
good practice.
Now consider two examples of application efforts at LLNL that use the
Lustre file systems more eciently: SILO and scalable checkpoint/restart.
 
Search WWH ::




Custom Search