Hardware Reference
In-Depth Information
Clients that may autonomously choose which servers receive writes based
on quality of service.
Servers that may move and replicate data autonomously to improve fault
tolerance.
Servers that may come and go quickly, due to failure or temporary in-
stantiation.
Clients that should be cooperative to help enforce consistency.
In short, Sirocco provides a self-organizing system that allows local decision
making in clients and servers to yield a self-tuning, reliable file system.
Sirocco enables self-organization by exposing the characteristics of each
server's storage to the storage system. Heterogeneous media is quickly be-
coming the norm in HPC systems, so tiered solutions (including the Burst
Buffer) are being proposed to manage it. Sirocco is implemented as a large
victim cache, with ejections targeted from less reliable storage to more reliable
storage automatically, without global knowledge of tiering. This is enabled by
resource discovery, where a storage node will seek more reliable nodes as tar-
gets for ejections. It is adaptive because the targets are not fixed, and can
be eschewed for targets that provide better service (including if the original
target ceases to operate).
The implication is that Sirocco clients can automatically use faster targets
as burst buffers for checkpoints, or even use temporary servers running on
compute nodes as RAM-backed stores. These ephemeral RAM-backed nodes
can be started on demand, and relinquished when they finish bleeding check-
points to disk.
Sirocco is currently under active development, and is being used as a vehi-
cle to investigate several research areas like data location [32] and fault-aware
message passing [15]. Sirocco is an effort of the Advanced Storage Group
(ASG), which includes Argonne National Laboratory, Clemson University, the
University of Alabama at Birmingham, and Texas A&M University.
34.3.2 Guarding against Single-Node Failures and Soft
Errors
In some circumstances, checkpoint I/O consumes a significant proportion
of an application's runtime. Oldeld et al. measured compute jobs on three
HPC platforms at 131,072 processes, and found that each spent about half
of their time checkpointing, with optimal checkpoint frequency being about
every twenty minutes [26]. This is an artifact of applications experiencing total
failure upon a single node failure, whether permanent or transient. If a job
can continue after a single node failure, then checkpointing can be done less
frequently.
 
Search WWH ::




Custom Search