Database Reference
In-Depth Information
thousands of nodes), individual workers often fail for some reason. If that
causes the entire MapReduce operation to fail, it might cause you to lose
a lot of work and have to restart the operation. Because the Mapper and
Reducer functions are free of side effects, they should be idempotent—that
is, you can rerun them and get the same results.
When a worker crashes, or even is particularly slow, the MapReduce
Controller assigns a new worker to operate over the same data. If the initial
worker completes, that should be fine because the operation was
idempotent. The Controller also needs to distinguish between a worker that
fails because of a hardware problem or a network hiccup and a worker
that fails because there is something wrong with the Mapper or Reducer
functions that causes it to crash on certain input data.
Comparative Analysis
The primary advantages of MapReduce are scalability and flexibility. Unlike
a relational database, where if you want it to be faster you need to change
how it is stored or buy faster hardware (which is often orders of magnitude
more expensive), you can scale a MapReduce by just buying more of the
same hardware. That is another way of saying that MapReduce scales out
linearly.
MapReduce is extremely flexible. Because the Mapper and Reducer
functions can be anything you want them to be, you can perform arbitrary
computations over your data. You're not locked into a language such as SQL
where you can perform only certain aggregations.
There are downsides, however. MapReduce is designed for batch workloads,
not actually interactive ones. MapReduce frameworks usually take a long
time to spin up the requisite number of workers, and the Shuffle operation
often adds long delays. Most MapReduce operations take minutes or hours
to run, rather than the seconds you'd hope for if you were performing
exploratory analysis on your data.
Another drawback of MapReduce is that it forces you to divide up your
operation into a Map and Reduce phase, which is not usually the way you'd
think about your data analysis task. What's more, many computations can't
be expressed in a single MapReduce, so they may require multiple passes
through the data. It can be tricky to keep track of workflow for jobs that do
more than one MapReduce.
Search WWH ::




Custom Search