Programming Practices - Hadoop in Action

Databases Reference

In-Depth Information

Table 6.7 Configuration properties for enabling and disabling speculative execution

Property

Description

mapred.map.tasks.speculative.

execution

Boolean property denoting whether speculative

execution is enabled for map tasks

mapred.reduce.tasks.speculative.

execution

Boolean property denoting whether speculative

execution is enabled for reduce tasks

You should leave speculative execution on in general. The primary reason to turn it off

is if your map tasks or reduce tasks

have side effects and are therefore not idempotent.

For example, if a task writes to external files, speculative execution can cause multiple

copies of a task to collide in attempting to create the same external files. You can turn

off speculative execution to ensure that only one copy of a task is being run at a time.

NOTE If your tasks have side effects, you should also think through how

Hadoop's recovery mechanism would interact with those side effects. For

example, if a task writes to an external file, it's possible that the task dies right

after writing to the external file. In that case, Hadoop will restart the task,

which will try to write to that external file again. You need to make sure your

tasks' operation remains correct in such situations.

6.3.6

Refactoring code and rewriting algorithms

If you're willing to rewrite your MapReduce programs to optimize performance, some

straightforward techniques and some nontrivial, application-dependent rewritings can

speed things up.

One straightforward technique for a Streaming

program is to rewrite it for Hadoop

Java. Streaming is great for quickly creating a MapReduce job for ad hoc data analysis,

but it doesn't run as fast as Java under Hadoop. Streaming jobs that start out as one-off

queries but end up being run frequently can gain from a Java re-implementation.

If you have several jobs that run on the same input data, there are probably

opportunities to rewrite them into fewer jobs. For example, if you're computing the

maximum as well as the minimum of a data set, you can write a single MapReduce job

that computes both rather than compute them separately using two different jobs. This

may sound obvious, but in practice many jobs are originally written to do one function

well. This is a good design practice. A job's conciseness makes it widely applicable

to different data sets for different purposes. Only after some usage should you start

looking for job groupings that you can rewrite to be faster.

One of the most important things you can do to speed up a MapReduce

program

is to think hard about the underlying algorithm and see if a more efficient algorithm

can compute the same results faster. This is true for any programming, but it is more

significant for MapReduce programs. Standard text books on algorithm and data

Hadoop in Action

Search WWH ::

Custom Search

Home