Information Technology Reference
In-Depth Information
Case Study: MapReduce
MapReduce is a system for processing large batches of data in parallel. Suppose you
needed to process 100 terabytes of data. Reading that much data from start to finish would
take a long time. Instead, dozens (or thousands) of machines could be set up to process a
portionofthedataeachinparallel.Thedatatheyreadwouldbeprocessed(mapped)tocre-
ate an intermediate result. These intermediates would be processed and then summarized
(reduced) to gain the final result. MapReduce did all the difficult work of dividing up the
data and delegating work to various machines. As a software developer using MapReduce,
youjustneededtowritethe“map”functionandthe“reduce”function.Everythingelsewas
taken care of.
BecauseMapReducewasacentralizedservice,itcouldbeimprovedandalluserswould
benefit from the changes. For example, there was a good probability that a machine would
failduringaMapReducerunthattakesdaysandinvolvesthousandsofmachines.Thelogic
for detecting a failed machine and sending its portion of data to another working machine
is incorporated in the MapReduce system. The developer does not need to worry about it.
Suppose a better way to detect and replace a failed machine is developed. This improve-
ment would be made to the central MapReduce system and all users would benefit.
One such improvement involved predicting which machines would fail. If one machine
wastakingconsiderablylongertoprocessitsshareofthedata,thatslowperformancecould
be a sign that it had a hard disk that was starting to fail or that some other problem was
present. MapReduce could direct another machine to process the same portion of data.
Whichevermachinefinishedfirstwould“win”anditsresultswouldbekept.Theotherma-
chine'sprocessingwouldbeaborted.Thiskindofpreemptiveredundancycouldsavehours
of processing, especially since a MapReduce run is only as fast as the slowest machine.
Many such optimizations were developed over the years.
Open source versions of MapReduce soon appeared. Now it was easy for other com-
panies to adopt MapReduce's techniques. The most popular implementation is Hadoop. It
includes a data storage system similar to GFS called HBase.
Scaling
If you've spent money to invent ways to create highly available distributed computing sys-
temsthatscalewell,thebestwaytoimproveyourreturnoninvestment (ROI)istousethat
technology on as many datacenters as possible.
Onceyouaremaintaininghundredsofracksofsimilarlyconfiguredcomputers,itmakes
sense to manage them centrally. Centralized operations have economic benefits. For ex-
ample, standardization makes it possible to automate operational tasks and therefore re-
Search WWH ::




Custom Search