How MapReduce Works - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

tasks in new containers outweighs the gain to be had in running them in parallel, com-

pared to running them sequentially on one node. Such a job is said to be uberized , or run

as an uber task .

What qualifies as a small job? By default, a small job is one that has less than 10 mappers,

only one reducer, and an input size that is less than the size of one HDFS block. (Note that

these values may be changed for a job by setting mapre-

duce.job.ubertask.maxmaps , mapreduce.job.ubertask.maxreduces ,

and mapreduce.job.ubertask.maxbytes .) Uber tasks must be enabled explicitly

(for an individual job, or across the cluster) by setting mapre-

duce.job.ubertask.enable to true .

Finally, before any tasks can be run, the application master calls the setupJob() meth-

od on the OutputCommitter . For FileOutputCommitter , which is the default, it

will create the final output directory for the job and the temporary working space for the

task output. The commit protocol is described in more detail in Output Committers .

Task Assignment

If the job does not qualify for running as an uber task, then the application master requests

containers for all the map and reduce tasks in the job from the resource manager (step 8).

Requests for map tasks are made first and with a higher priority than those for reduce

tasks, since all the map tasks must complete before the sort phase of the reduce can start

(see Shuffle and Sort ). Requests for reduce tasks are not made until 5% of map tasks have

completed (see Reduce slow start ).

Reduce tasks can run anywhere in the cluster, but requests for map tasks have data locality

constraints that the scheduler tries to honor (see Resource Requests ) . In the optimal case,

the task is data local — that is, running on the same node that the split resides on. Altern-

atively, the task may be rack local : on the same rack, but not the same node, as the split.

Some tasks are neither data local nor rack local and retrieve their data from a different

rack than the one they are running on. For a particular job run, you can determine the

number of tasks that ran at each locality level by looking at the job's counters (see

Table 9-6 ).

Requests also specify memory requirements and CPUs for tasks. By default, each map

and reduce task is allocated 1,024 MB of memory and one virtual core. The values are

configurable on a per-job basis (subject to minimum and maximum values described in

Memory settings in YARN and MapReduce ) via the following properties: mapre-

duce.map.memory.mb , mapreduce.reduce.memory.mb , mapre-

duce.map.cpu.vcores and mapreduce.reduce.cpu.vcores .

Search WWH ::

Custom Search

Home