Hardware Reference
In-Depth Information
The first item says that you need to have fault-tolerant software. Even with the
best of equipment, when you have a massive number of components, some will fail
and the software has to be able to handle it. Whether you have one failure a week
or two, the software has to be able to handle failures.
The second item points out that both hardware and software have to be highly
redundant. Doing so not only improves the fault-tolerance properties, but also the
throughput. In the case of Google, the PCs, disks, cables, and switches are all
replicated many times over. Furthermore, the index and the documents are broken
into shards and the shards are heavily replicated in each data center and the data
centers are themselves replicated.
The third item is a consequence of the first two. If the system has been prop-
erly designed to deal with failures, buying expensive components such as RAIDs
with SCSI disks is a mistake. Even they will fail, but spending 10 times as much to
cut the failure rate in half is a bad idea. Better to buy 10 times as much hardware
and deal with the failures when they occur. At the very least, having more hard-
ware will give better performance when everything is working.
For more information about Google, see Barroso et al. (2003), and Ghemawat
et al. (2003).
8.4.4 Communication Software for Multicomputers
Programming a multicomputer requires special software, usually libraries, for
handling interprocess communication and synchronization. In this section we will
say a few words about this software. For the most part, the same software pack-
ages run on MPPs and clusters, so applications can be easily ported between plat-
forms.
Message-passing systems have two or more processes running independently
of one another. For example, one process may be producing some data and one or
more others may be consuming it. There is no guarantee that when the sender has
more data the receiver(s) will be ready for it, as each one runs its own program.
Most message-passing systems provide two primitives (usually library calls),
send and receive , but several different kinds of semantics are possible. The three
main variants are
1. Synchronous message passing.
2. Buffered message passing.
3. Nonblocking message passing.
In synchronous message passing , if the sender executes a send and the re-
ceiver has not yet executed a receive , the sender is blocked (suspended) until the
receiver executes a receive , at which time the message is copied. When the sender
gets control back after the call, it knows that the message has been sent and cor-
rectly received. This method has the simplest semantics and does not require any
 
 
Search WWH ::




Custom Search