Integrated Performance Monitoring - High Performance Parallel I/O

Hardware Reference

In-Depth Information

practice often the primary source of I/O performance loss. Root causes for

this imbalance include many-to-few I/O strategies, file system striping, and

congestion of I/O due to overlapping I/O operations. I/O performance losses,

balanced or not, are sometimes due to the transfer (buffer) sizes of I/Os being

so small that transactional overheads are high, or due to synchronous locking

I/O operations.

When the I/O load is balanced across tasks, the next question is whether

the sustained rates are achieving the I/O rates that the storage system is

expected to deliver. If not, then what is the underlying cause of the loss? In

some cases the loss is due to the strategy, and in others cases is due to resource

scheduling or contention that lies outside the application's control. Defensive

I/O strategies are thus sought as much as absolutely optimal strategies. I/O

hangs are a notorious source of vexation among HPC enthusiasts and the

notions of defense extends to the lower ends of performance. Most HPC I/O

goes unmonitored and this is likely a rich area for investigation to guide future

data science architectures [5].

There is much to be gained from continued research in this area. As exa-

scale architectures emerge, the pathways from compute core to disk will be-

come more complex as will their performance. It is interesting to consider

architectural simulation in the design and provisioning of such systems. Given

a body of existing I/O profiles, can one map these into a performance estimate

as to what would be possible on a proposed architecture? To what degree can

we construct useful models for the design of future I/O systems [3]?

To make actionable decisions about I/O it is important to build models

from profiles that are tightly integrated with application performance as it

happens. The following sections draw from HPC application I/O scenarios

observed at NERSC using IPM.

26.2 Success Stories

26.2.1 Chombo's ftruncate

Chombo's ftruncate is a simple case study that shows why profiling is

best done in a production setting. Figure 26.5 shows a wide range of I/O tun-

ing techniques applied by HPC experts to the Chombo code. The dominant

increase in I/O bandwidth is attributable to removing an extraneous POSIX

call from the production-deployed parallel I/O libraries. A profiling interface

that captures either the application's I/O activity, the operating system's, or

preferably both is often enough to reveal which type of I/O and/or which

system resources drive the time spent in I/O. In some cases the improve-

ments listed above took place in how the middleware is used and in other

cases changes were made directly to the middleware. For instance, the remove

Search WWH ::

Custom Search

Home