Hardware Reference
In-Depth Information
To amortize the cost of repair, failed machines are addressed in batches by repair techni-
cians. When the diagnosis software is confident in its assessment, the part is immediately re-
placed without going through the manual diagnosis process. For example, if the diagnostic
says disk 3 of a storage node is bad, the disk is replaced immediately. Failed machines with no
diagnostic or with low-confidence diagnostics are examined manually.
The goal is to have less than 1% of all nodes in the manual repair queue at any one time.
The average time in the repair queue is a week, even though it takes much less time for repair
technician to fix it. The longer latency suggests the importance of repair throughput, which
affects cost of operations. Note that the automated repairs of the first step take minutes for a
reboot/reinstall to hours for running directed stress tests to make sure the machine is indeed
operational.
These latencies do not take into account the time to idle the broken servers. The reason is
that a big variable is the amount of state in the node. A stateless node takes much less time
than a storage node whose data may need to be evacuated before it can be replaced.
Summary
As of 2007, Google had already demonstrated several innovations to improve the energy ei-
ciency of its WSCs to deliver a PUE of 1.23 in Google A:
■ In addition to providing an inexpensive shell to enclose servers, the modified shipping con-
tainers separate hot and cold air plenums, which helps reduce the variation in intake air
temperature for servers. With less severe worst-case hot spots, cold air can be delivered at
warmer temperatures.
■ These containers also shrink the distance of the air circulation loop, which reduces energy
to move air.
■ Operating servers at higher temperatures means that air only has to be chilled to 81°F
(27°C) instead of the traditional 64°F to 71°F (18°C to 22°C).
■ A higher target cold air temperature helps put the facility more often within the range that
can be sustained by evaporative cooling solutions (cooling towers), which are more energy
eicient than traditional chillers.
■ Deploying WSCs in temperate climates to allow use of evaporative cooling exclusively for
portions of the year.
■ Deploying extensive monitoring hardware and software to measure actual PUE versus de-
signed PUE improves operational efficiency.
■ Operating more servers than the worst-case scenario for the power distribution system
would suggest, since it's statistically unlikely that thousands of servers would all be highly
busy simultaneously, yet rely on the monitoring system to off-load work in the unlikely
case that they did [ Fan, Weber, and Barroso 2007 ] [ Ranganathan et al. 2006 ]. PUE improves
because the facility is operating closer to its fully designed capacity, where it is at its most
eicient because the servers and cooling systems are not energy proportional. Such in-
creased utilization reduces demand for new servers and new WSCs.
■ Designing motherboards that only need a single 12-volt supply so that the UPS function
could be supplied by standard bateries associated with each server instead of a batery
room, thereby lowering costs and reducing one source of inefficiency of power distribution
within a WSC.
■ Carefully designing the server board itself to improve its energy efficiency. For example,
underclocking the front-side bus on these microprocessors reduces energy usage with neg-
Search WWH ::




Custom Search