The Origins and Future of Distributed Computing and Clouds - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

Case Study: MapReduce

MapReduce is a system for processing large batches of data in parallel. Suppose you

needed to process 100 terabytes of data. Reading that much data from start to finish would

take a long time. Instead, dozens (or thousands) of machines could be set up to process a

portionofthedataeachinparallel.Thedatatheyreadwouldbeprocessed(mapped)tocre-

ate an intermediate result. These intermediates would be processed and then summarized

(reduced) to gain the final result. MapReduce did all the difficult work of dividing up the

data and delegating work to various machines. As a software developer using MapReduce,

youjustneededtowritethe“map”functionandthe“reduce”function.Everythingelsewas

taken care of.

BecauseMapReducewasacentralizedservice,itcouldbeimprovedandalluserswould

benefit from the changes. For example, there was a good probability that a machine would

failduringaMapReducerunthattakesdaysandinvolvesthousandsofmachines.Thelogic

for detecting a failed machine and sending its portion of data to another working machine

is incorporated in the MapReduce system. The developer does not need to worry about it.

Suppose a better way to detect and replace a failed machine is developed. This improve-

ment would be made to the central MapReduce system and all users would benefit.

One such improvement involved predicting which machines would fail. If one machine

wastakingconsiderablylongertoprocessitsshareofthedata,thatslowperformancecould

be a sign that it had a hard disk that was starting to fail or that some other problem was

present. MapReduce could direct another machine to process the same portion of data.

Whichevermachinefinishedfirstwould“win”anditsresultswouldbekept.Theotherma-

chine'sprocessingwouldbeaborted.Thiskindofpreemptiveredundancycouldsavehours

of processing, especially since a MapReduce run is only as fast as the slowest machine.

Many such optimizations were developed over the years.

Open source versions of MapReduce soon appeared. Now it was easy for other com-

panies to adopt MapReduce's techniques. The most popular implementation is Hadoop. It

includes a data storage system similar to GFS called HBase.

Scaling

If you've spent money to invent ways to create highly available distributed computing sys-

temsthatscalewell,thebestwaytoimproveyourreturnoninvestment (ROI)istousethat

technology on as many datacenters as possible.

Onceyouaremaintaininghundredsofracksofsimilarlyconfiguredcomputers,itmakes

sense to manage them centrally. Centralized operations have economic benefits. For ex-

ample, standardization makes it possible to automate operational tasks and therefore re-

Search WWH ::

Custom Search

Home