Database Reference
In-Depth Information
Streaming environment variables
Hadoop sets job configuration parameters as environment variables for Streaming pro-
grams. However, it replaces nonalphanumeric characters with underscores to make sure
they are valid names. The following Python expression illustrates how you can retrieve
the value of the mapreduce.job.id property from within a Python Streaming script:
os . environ [ "mapreduce_job_id" ]
You can also set environment variables for the Streaming processes launched by MapRe-
duce by supplying the -cmdenv option to the Streaming launcher program (once for each
variable you wish to set). For example, the following sets the MAGIC_PARAMETER en-
vironment variable:
-cmdenv MAGIC_PARAMETER=abracadabra
Speculative Execution
The MapReduce model is to break jobs into tasks and run the tasks in parallel to make the
overall job execution time smaller than it would be if the tasks ran sequentially. This
makes the job execution time sensitive to slow-running tasks, as it takes only one slow
task to make the whole job take significantly longer than it would have done otherwise.
When a job consists of hundreds or thousands of tasks, the possibility of a few straggling
tasks is very real.
Tasks may be slow for various reasons, including hardware degradation or software mis-
configuration, but the causes may be hard to detect because the tasks still complete suc-
cessfully, albeit after a longer time than expected. Hadoop doesn't try to diagnose and fix
slow-running tasks; instead, it tries to detect when a task is running slower than expected
and launches another equivalent task as a backup. This is termed speculative execution of
tasks.
It's important to understand that speculative execution does not work by launching two
duplicate tasks at about the same time so they can race each other. This would be wasteful
of cluster resources. Rather, the scheduler tracks the progress of all tasks of the same type
(map and reduce) in a job, and only launches speculative duplicates for the small propor-
tion that are running significantly slower than the average. When a task completes suc-
cessfully, any duplicate tasks that are running are killed since they are no longer needed.
So, if the original task completes before the speculative task, the speculative task is killed;
on the other hand, if the speculative task finishes first, the original is killed.
Speculative execution is an optimization, and not a feature to make jobs run more reliably.
If there are bugs that sometimes cause a task to hang or slow down, relying on speculative
Search WWH ::




Custom Search