Hardware Reference
In-Depth Information
4.4.1.2
Tuning
The initial configuration and tuning of these systems took a significant
effort to achieve good performance levels. Once the basic hardware was vali-
dated to achieve its performance, then the system was tested as a whole with
GPFS. One of the common complications was the differing needs of the hard-
ware for best performance. For example, the Myrinet or Infiniband networks
achieve maximum throughput more easily using smaller message sizes, such as
64{256 KiB; however the storage arrays deliver much better performance us-
ing larger request sizes such as 4096 or 8192 KiB. On the Blue Gene systems
specifically, the I/O forwarder must also transfer messages in certain sizes,
which needs to be tuned as well since the larger the maximum transfer size,
the more memory the I/O forwarder process needs to reserve for data trans-
fer. On both systems the message size is based on the request size to optimize
the disk storage performance. The time spent to tune the component pieces
as well as GPFS was around six months for each system. One of the major
diculties is that GPFS (and PVFS) have many tunable parameters to dial
in performance. These parameters often have internal dependencies, so setting
one may affect how another behaves. Testing and changing these parameters
is the most tedious task facing the deployer of a parallel file system.
4.4.1.3
Reliability
GPFS has proven to be quite reliable in day-to-day operations. If a file
server or controller goes oine, the system will continue to operate normally
without applications noticing the failure with the exception of reduced per-
formance. However, the complexity of GPFS does introduce its own set of
problems. GPFS has several manager functions (quorum, token, file system)
that are critical to operations. Originally on Intrepid, all 128 servers oper-
ated as both quorum nodes and manager nodes. This means that all nodes
were able to contribute to decision making and possibly assume a particular
management function. This caused unusual failure modes and, in consultation
with IBM, the number of quorum and managers was reduced to 16: one for
each array. This resolved the issue of certain failover operations taking much
longer than expected.
For an extended period of time, GPFS was ALCF's number one source of
job failures on Intrepid by a large margin. The primary mode of failure was
that a job would fail to boot because the I/O node could not join the GPFS
cluster. This problem boiled down to a lack of resources for token management
on the client cluster. The client cluster of the Blue Gene /P I/O node con-
sisted of only one manager node, the service node, to handle all management
functions. At the time, this node was quite powerful but it also had to run the
entire Blue Gene control system. The I/O nodes could not be manager nodes
because they were not persistent. The I/O nodes rebooted with every job.
ALCF deployed two additional servers, with 24 GB of RAM, which became
dedicated token managers in an active/passive redundancy configuration.
 
Search WWH ::




Custom Search