Understanding Query Execution - Google BigQuery Analytics

Database Reference

In-Depth Information

thousands of nodes), individual workers often fail for some reason. If that

causes the entire MapReduce operation to fail, it might cause you to lose

a lot of work and have to restart the operation. Because the Mapper and

Reducer functions are free of side effects, they should be idempotent—that

is, you can rerun them and get the same results.

When a worker crashes, or even is particularly slow, the MapReduce

Controller assigns a new worker to operate over the same data. If the initial

worker completes, that should be fine because the operation was

idempotent. The Controller also needs to distinguish between a worker that

fails because of a hardware problem or a network hiccup and a worker

that fails because there is something wrong with the Mapper or Reducer

functions that causes it to crash on certain input data.

Comparative Analysis

The primary advantages of MapReduce are scalability and flexibility. Unlike

a relational database, where if you want it to be faster you need to change

how it is stored or buy faster hardware (which is often orders of magnitude

more expensive), you can scale a MapReduce by just buying more of the

same hardware. That is another way of saying that MapReduce scales out

linearly.

MapReduce is extremely flexible. Because the Mapper and Reducer

functions can be anything you want them to be, you can perform arbitrary

computations over your data. You're not locked into a language such as SQL

where you can perform only certain aggregations.

There are downsides, however. MapReduce is designed for batch workloads,

not actually interactive ones. MapReduce frameworks usually take a long

time to spin up the requisite number of workers, and the Shuffle operation

often adds long delays. Most MapReduce operations take minutes or hours

to run, rather than the seconds you'd hope for if you were performing

exploratory analysis on your data.

Another drawback of MapReduce is that it forces you to divide up your

operation into a Map and Reduce phase, which is not usually the way you'd

think about your data analysis task. What's more, many computations can't

be expressed in a single MapReduce, so they may require multiple passes

through the data. It can be tricky to keep track of workflow for jobs that do

more than one MapReduce.

Search WWH ::

Custom Search

Home