Argonne Leadership Computing Facility - High Performance Parallel I/O

Hardware Reference

In-Depth Information

4.4.1.2

Tuning

The initial configuration and tuning of these systems took a significant

effort to achieve good performance levels. Once the basic hardware was vali-

dated to achieve its performance, then the system was tested as a whole with

GPFS. One of the common complications was the differing needs of the hard-

ware for best performance. For example, the Myrinet or Infiniband networks

achieve maximum throughput more easily using smaller message sizes, such as

64{256 KiB; however the storage arrays deliver much better performance us-

ing larger request sizes such as 4096 or 8192 KiB. On the Blue Gene systems

specifically, the I/O forwarder must also transfer messages in certain sizes,

which needs to be tuned as well since the larger the maximum transfer size,

the more memory the I/O forwarder process needs to reserve for data trans-

fer. On both systems the message size is based on the request size to optimize

the disk storage performance. The time spent to tune the component pieces

as well as GPFS was around six months for each system. One of the major

diculties is that GPFS (and PVFS) have many tunable parameters to dial

in performance. These parameters often have internal dependencies, so setting

one may affect how another behaves. Testing and changing these parameters

is the most tedious task facing the deployer of a parallel file system.

4.4.1.3

Reliability

GPFS has proven to be quite reliable in day-to-day operations. If a file

server or controller goes oine, the system will continue to operate normally

without applications noticing the failure with the exception of reduced per-

formance. However, the complexity of GPFS does introduce its own set of

problems. GPFS has several manager functions (quorum, token, file system)

that are critical to operations. Originally on Intrepid, all 128 servers oper-

ated as both quorum nodes and manager nodes. This means that all nodes

were able to contribute to decision making and possibly assume a particular

management function. This caused unusual failure modes and, in consultation

with IBM, the number of quorum and managers was reduced to 16: one for

each array. This resolved the issue of certain failover operations taking much

longer than expected.

For an extended period of time, GPFS was ALCF's number one source of

job failures on Intrepid by a large margin. The primary mode of failure was

that a job would fail to boot because the I/O node could not join the GPFS

cluster. This problem boiled down to a lack of resources for token management

on the client cluster. The client cluster of the Blue Gene /P I/O node con-

sisted of only one manager node, the service node, to handle all management

functions. At the time, this node was quite powerful but it also had to run the

entire Blue Gene control system. The I/O nodes could not be manager nodes

because they were not persistent. The I/O nodes rebooted with every job.

ALCF deployed two additional servers, with 24 GB of RAM, which became

dedicated token managers in an active/passive redundancy configuration.

High Performance Parallel I/O

Search WWH ::

Custom Search

Home