Database Reference
In-Depth Information
tasks in new containers outweighs the gain to be had in running them in parallel, com-
pared to running them sequentially on one node. Such a job is said to be uberized , or run
as an uber task .
What qualifies as a small job? By default, a small job is one that has less than 10 mappers,
only one reducer, and an input size that is less than the size of one HDFS block. (Note that
these values may be changed for a job by setting mapre-
duce.job.ubertask.maxmaps , mapreduce.job.ubertask.maxreduces ,
and mapreduce.job.ubertask.maxbytes .) Uber tasks must be enabled explicitly
(for an individual job, or across the cluster) by setting mapre-
duce.job.ubertask.enable to true .
Finally, before any tasks can be run, the application master calls the setupJob() meth-
od on the OutputCommitter . For FileOutputCommitter , which is the default, it
will create the final output directory for the job and the temporary working space for the
task output. The commit protocol is described in more detail in Output Committers .
Task Assignment
If the job does not qualify for running as an uber task, then the application master requests
containers for all the map and reduce tasks in the job from the resource manager (step 8).
Requests for map tasks are made first and with a higher priority than those for reduce
tasks, since all the map tasks must complete before the sort phase of the reduce can start
(see Shuffle and Sort ). Requests for reduce tasks are not made until 5% of map tasks have
completed (see Reduce slow start ).
Reduce tasks can run anywhere in the cluster, but requests for map tasks have data locality
constraints that the scheduler tries to honor (see Resource Requests ) . In the optimal case,
the task is data local — that is, running on the same node that the split resides on. Altern-
atively, the task may be rack local : on the same rack, but not the same node, as the split.
Some tasks are neither data local nor rack local and retrieve their data from a different
rack than the one they are running on. For a particular job run, you can determine the
number of tasks that ran at each locality level by looking at the job's counters (see
Table 9-6 ).
Requests also specify memory requirements and CPUs for tasks. By default, each map
and reduce task is allocated 1,024 MB of memory and one virtual core. The values are
configurable on a per-job basis (subject to minimum and maximum values described in
Memory settings in YARN and MapReduce ) via the following properties: mapre-
duce.map.memory.mb , mapreduce.reduce.memory.mb , mapre-
duce.map.cpu.vcores and mapreduce.reduce.cpu.vcores .
Search WWH ::




Custom Search