Databases Reference
In-Depth Information
Table 6.7 Configuration properties for enabling and disabling speculative execution
Property
Description
mapred.map.tasks.speculative.
execution
Boolean property denoting whether speculative
execution is enabled for map tasks
mapred.reduce.tasks.speculative.
execution
Boolean property denoting whether speculative
execution is enabled for reduce tasks
You should leave speculative execution on in general. The primary reason to turn it off
is if your map tasks or reduce tasks
have side effects and are therefore not idempotent.
For example, if a task writes to external files, speculative execution can cause multiple
copies of a task to collide in attempting to create the same external files. You can turn
off speculative execution to ensure that only one copy of a task is being run at a time.
NOTE If your tasks have side effects, you should also think through how
Hadoop's recovery mechanism would interact with those side effects. For
example, if a task writes to an external file, it's possible that the task dies right
after writing to the external file. In that case, Hadoop will restart the task,
which will try to write to that external file again. You need to make sure your
tasks' operation remains correct in such situations.
6.3.6
Refactoring code and rewriting algorithms
If you're willing to rewrite your MapReduce programs to optimize performance, some
straightforward techniques and some nontrivial, application-dependent rewritings can
speed things up.
One straightforward technique for a Streaming
program is to rewrite it for Hadoop
Java. Streaming is great for quickly creating a MapReduce job for ad hoc data analysis,
but it doesn't run as fast as Java under Hadoop. Streaming jobs that start out as one-off
queries but end up being run frequently can gain from a Java re-implementation.
If you have several jobs that run on the same input data, there are probably
opportunities to rewrite them into fewer jobs. For example, if you're computing the
maximum as well as the minimum of a data set, you can write a single MapReduce job
that computes both rather than compute them separately using two different jobs. This
may sound obvious, but in practice many jobs are originally written to do one function
well. This is a good design practice. A job's conciseness makes it widely applicable
to different data sets for different purposes. Only after some usage should you start
looking for job groupings that you can rewrite to be faster.
One of the most important things you can do to speed up a MapReduce
program
is to think hard about the underlying algorithm and see if a more efficient algorithm
can compute the same results faster. This is true for any programming, but it is more
significant for MapReduce programs. Standard text books on algorithm and data
 
Search WWH ::




Custom Search