How MapReduce Works - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Streaming environment variables

Hadoop sets job configuration parameters as environment variables for Streaming pro-

grams. However, it replaces nonalphanumeric characters with underscores to make sure

they are valid names. The following Python expression illustrates how you can retrieve

the value of the mapreduce.job.id property from within a Python Streaming script:

os . environ [ "mapreduce_job_id" ]

You can also set environment variables for the Streaming processes launched by MapRe-

duce by supplying the -cmdenv option to the Streaming launcher program (once for each

variable you wish to set). For example, the following sets the MAGIC_PARAMETER en-

vironment variable:

-cmdenv MAGIC_PARAMETER=abracadabra

Speculative Execution

The MapReduce model is to break jobs into tasks and run the tasks in parallel to make the

overall job execution time smaller than it would be if the tasks ran sequentially. This

makes the job execution time sensitive to slow-running tasks, as it takes only one slow

task to make the whole job take significantly longer than it would have done otherwise.

When a job consists of hundreds or thousands of tasks, the possibility of a few straggling

tasks is very real.

Tasks may be slow for various reasons, including hardware degradation or software mis-

configuration, but the causes may be hard to detect because the tasks still complete suc-

cessfully, albeit after a longer time than expected. Hadoop doesn't try to diagnose and fix

slow-running tasks; instead, it tries to detect when a task is running slower than expected

and launches another equivalent task as a backup. This is termed speculative execution of

tasks.

It's important to understand that speculative execution does not work by launching two

duplicate tasks at about the same time so they can race each other. This would be wasteful

of cluster resources. Rather, the scheduler tracks the progress of all tasks of the same type

(map and reduce) in a job, and only launches speculative duplicates for the small propor-

tion that are running significantly slower than the average. When a task completes suc-

cessfully, any duplicate tasks that are running are killed since they are no longer needed.

So, if the original task completes before the speculative task, the speculative task is killed;

on the other hand, if the speculative task finishes first, the original is killed.

Speculative execution is an optimization, and not a feature to make jobs run more reliably.

If there are bugs that sometimes cause a task to hang or slow down, relying on speculative

Search WWH ::

Custom Search

Home