Hardware Reference
In-Depth Information
6.10 , these economies of scale led to cloud computing, as the lower per-unit costs of a WSC
meant that companies could rent them at a profit below what it costs outsiders to do it
themselves. The flip side of 50,000 servers is failures. Figure 6.1 shows outages and anom-
alies for 2400 servers. Even if a server had a mean time to failure (MTTF) of an amazing
25 years (200,000 hours), the WSC architect would need to design for 5 server failures a
day. Figure 6.1 lists the annualized disk failure rate as 2% to 10%. If there were 4 disks per
server and their annual failure rate was 4%, with 50,000 servers the WSC architect should
expect to see one disk fail per hour .
FIGURE 6.1 List of outages and anomalies with the approximate frequencies of occur-
rences in the first year of a new cluster of 2400 servers . We label what Google calls a
cluster an array ; see Figure 6.5 . (Based on Barroso [2010] .)
Example
Calculate the availability of a service running on the 2400 servers in Figure 6.1 .
Unlike a service in a real WSC, in this example the service cannot tolerate hard-
ware or software failures. Assume that the time to reboot software is 5 minutes
and the time to repair hardware is 1 hour.
Answer
We can estimate service availability by calculating the time of outages due to
failures of each component. We'll conservatively take the lowest number in each
category in Figure 6.1 and split the 1000 outages evenly between four compon-
ents. We ignore slow disks—the fifth component of the 1000 outages—since
they hurt performance but not availability, and power utility failures, since the
uninterruptible power supply (UPS) system hides 99% of them.
 
Search WWH ::




Custom Search