Hardware Reference
In-Depth Information
Clusters are often smallish affairs, ranging from a dozen to perhaps 500 PCs.
However, it is also possible to build very large ones from off-the-shelf PCs.
Google has done this in an interesting way that we will now look at.
Google
Google is a popular search engine for finding information on the Internet.
While its popularity is due, in part, to its simple interface and fast response time,
its design is anything but simple. From Google's point of view, the problem is that
it has to find, index, and store the entire World Wide Web (an estimated 40 billion
pages), be able to search the whole thing in under 0.5 sec, and handle tens of thou-
sands of queries/sec coming from all over the world 24 hours a day. In addition, it
must never go down, not even in the face of earthquakes, electrical power failures,
telecom outages, hardware failures and software bugs. And, of course, it has to do
all of this as cheaply as possible. Building a Google clone is definitely not an exer-
cise for the reader.
How does Google do it? To start with, Google operates multiple data centers
around the world. Not only does this approach provide backups in case one of
them is swallowed by an earthquake, but when www.google.com is looked up, the
sender's IP address is inspected and the address of the nearest data center is sup-
plied. The browser then sends the query there.
Each data center has at least one OC-48 (2.488-Gbps) fiber-optics connection
to the Internet, on which it receives queries and sends answers, as well as an
OC-12 (622-Mbps) backup connection from a different telecom provider, in case
the primary ones go down. Uninterruptable power supplies and emergency diesel
generators are available at all data centers to keep the show going during power
failures. Consequently, during a major natural disaster, performance will suffer,
but Google will keep running.
To get a better understanding of why Google chose the architecture it did, it is
useful to briefly describe how a query is processed once it hits its designated data
center. After arriving at the data center (step 1 in Fig. 8-43), the load balancer
routes the query to one of the many query handlers (2), and to the spelling checker
(3) and ad server (4) in parallel. Then the search words are looked up on the index
servers (5) in parallel. These servers contain an entry for each word on the Web.
Each entry lists all the documents (Web pages, PDF files, PowerPoint pres-
entations, etc.) containing the word, sorted in page-rank order. Page rank is deter-
mined by a complicated (and secret) formula, but the number of links to a page and
their own ranks play a large role.
To get higher performance, the index is divided into pieces called shards that
can be searched in parallel. Conceptually, at least, shard 1 contains all the words in
the index, with each one followed by the IDs of the n highest-ranked documents
containing that word. Shard 2 contains all the words and the IDs of the n next-
 
Search WWH ::




Custom Search