Hardware Reference
In-Depth Information
highest-ranked documents, and so on. As the Web grows, each of these shards
may later be split with the first k words in one set of shards, the next k words in a
second set of shards and so forth, in order to achieve even more search parallelism.
1
11
Load balancer
2
3
4
Spell checker
Query handler
Ad server
9
10
5
7
8
6
5
aardvark 154, 3016, ...
abacus 973, 67231, ...
abalone 73403,89021, ...
abandon 14783, 63495, ...
.
Document
servers
8
Index
servers
6
7
aardvark 1242, 5643 ...
abacus 8393, 65837, ...
abalone 59343, 93082, ...
abandon 40323, 94834, ...
.
Figure 8-43. Processing of a Google query.
The index servers return a set of document identifiers (6) that are then combin-
ed according to the Boolean properties of the query. For example, if the search
was for +digital +capybara +dance, then only document identifiers appearing in all
three sets are used in the next step. In this step (7), the documents themselves are
referenced to extract their titles, URLs, and snippets of text surrounding the search
terms. The document servers contain many copies of the entire Web at each data
center, hundreds of terabytes at present. The documents are also divided into
shards to enhance parallel search. While processing a query does not require read-
ing the whole Web (or even reading the tens of terabytes on the index servers), hav-
ing to process 100 MB per query is normal.
When the results are returned to the query handler (8), the pages found are col-
lated into page-rank order. If potential spelling errors are detected (9), they are
announced and relevant ads are added (10). Displaying ads for advertisers inter-
ested in buying specific search terms (e.g., ''hotel'' or ''camcorder'') is how
Google makes its money. Finally, the results are formatted in HTML (HyperText
Markup Language) and sent to the user as a Web page.
Search WWH ::




Custom Search