Information Technology Reference
In-Depth Information
Forexample,imagineacorpusofthetextextractedfromeverybookintheU.S.Library
ofCongress.Thiscannotfitinonecomputer,soinsteadtheinformationisspreadoverhun-
dredsorthousandsofleafmachines.Inadditiontotheleafmachinesaretheparentsandthe
root.Asearchquerywouldgotoarootserver,whichinturnrelaysthequerytoallparents.
Each parent repeats the query to all leaf nodes below it. Once the leaves have replied, the
parent ranks and sorts the results by relevancy.
Forexample,aleafmayreplythatallthewordsofthequeryexistinthesameparagraph
in one topic,but for another topic only some of the words exist(lessrelevant),ortheyexist
but not in the same paragraph or page (even less relevant). If the query is for the best 50
answers, the parent can send the top 50 results to the root and drop the rest. The root then
receives results from each parent and selects the best 50 of those to construct the reply.
This scheme also permits developers to work within a latency budget. If fast answers
are more important than perfect answers, parents and roots do not have to wait for slow
replies if the latency deadline is near.
Many variations of this pattern are possible. Redundant servers may exist with a load-
balancingschemetodividetheworkamongthemandroutearoundfailedservers.Expand-
ing the number of leaf servers can give each leaf a smaller portion of the corpus to search,
or each shard of corpus can be placed on multiple leaf servers to improve availability. Ex-
panding the number of parents at each level increases the capacity to sort and rank res-
ults.Theremaybeadditionallevelsofparentservers,makingthetreetaller.Theadditional
levels permit awiderfanout,whichisimportant foranextremely largecorpus.Theparents
may provide a caching function to relieve pressure on the leaf servers; in this case more
levelsofparentsmayimprovecacheeffectiveness.Thesetechniquescanalsohelpmitigate
congestion problems related to fan-in, as discussed in the previous section.
1.4 Distributed State
Large systems often store or process large amounts of state. State consists of data, such as
a database, that is frequently updated. Contrast this with a corpus, which is relatively static
orisupdatedonlyperiodicallywhenaneweditionispublished.Forexample,asystemthat
searches the U.S. Library of Congress may receive a new corpus each week. By compar-
ison, an email system is in constant churn with new data arriving constantly, current data
being updated (email messages being marked as “read” or moved between folders), and
data being deleted.
Distributed computing systems have many ways to deal with state. However, they all
involve some kind of replication and sharding, which brings about problems of consisten-
cy, availability, and partitioning.
Search WWH ::




Custom Search