Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

6.10 , these economies of scale led to cloud computing, as the lower per-unit costs of a WSC

meant that companies could rent them at a profit below what it costs outsiders to do it

themselves. The flip side of 50,000 servers is failures. Figure 6.1 shows outages and anom-

alies for 2400 servers. Even if a server had a mean time to failure (MTTF) of an amazing

25 years (200,000 hours), the WSC architect would need to design for 5 server failures a

day. Figure 6.1 lists the annualized disk failure rate as 2% to 10%. If there were 4 disks per

server and their annual failure rate was 4%, with 50,000 servers the WSC architect should

expect to see one disk fail per hour .

FIGURE 6.1 List of outages and anomalies with the approximate frequencies of occur-

rences in the first year of a new cluster of 2400 servers . We label what Google calls a

cluster an array ; see Figure 6.5 . (Based on Barroso [2010] .)

Example

Calculate the availability of a service running on the 2400 servers in Figure 6.1 .

Unlike a service in a real WSC, in this example the service cannot tolerate hard-

ware or software failures. Assume that the time to reboot software is 5 minutes

and the time to repair hardware is 1 hour.

Answer

We can estimate service availability by calculating the time of outages due to

failures of each component. We'll conservatively take the lowest number in each

category in Figure 6.1 and split the 1000 outages evenly between four compon-

ents. We ignore slow disks—the fifth component of the 1000 outages—since

they hurt performance but not availability, and power utility failures, since the

uninterruptible power supply (UPS) system hides 99% of them.

Search WWH ::

Custom Search

Home