Database Reference
In-Depth Information
tasks in new containers outweighs the gain to be had in running them in parallel, com-
pared to running them sequentially on one node. Such a job is said to be
uberized
, or run
as an
uber task
.
What qualifies as a small job? By default, a small job is one that has less than 10 mappers,
only one reducer, and an input size that is less than the size of one HDFS block. (Note that
these values may be changed for a job by setting
mapre-
duce.job.ubertask.maxmaps
,
mapreduce.job.ubertask.maxreduces
,
and
mapreduce.job.ubertask.maxbytes
.) Uber tasks must be enabled explicitly
(for an individual job, or across the cluster) by setting
mapre-
duce.job.ubertask.enable
to
true
.
Finally, before any tasks can be run, the application master calls the
setupJob()
meth-
od on the
OutputCommitter
. For
FileOutputCommitter
, which is the default, it
will create the final output directory for the job and the temporary working space for the
task output. The commit protocol is described in more detail in
Output Committers
.
Task Assignment
If the job does not qualify for running as an uber task, then the application master requests
containers for all the map and reduce tasks in the job from the resource manager (step 8).
Requests for map tasks are made first and with a higher priority than those for reduce
tasks, since all the map tasks must complete before the sort phase of the reduce can start
(see
Shuffle and Sort
). Requests for reduce tasks are not made until 5% of map tasks have
completed (see
Reduce slow start
).
Reduce tasks can run anywhere in the cluster, but requests for map tasks have data locality
constraints that the scheduler tries to honor (see
Resource Requests
)
. In the optimal case,
the task is
data local
— that is, running on the same node that the split resides on. Altern-
atively, the task may be
rack local
: on the same rack, but not the same node, as the split.
Some tasks are neither data local nor rack local and retrieve their data from a different
rack than the one they are running on. For a particular job run, you can determine the
number of tasks that ran at each locality level by looking at the job's counters (see
Table 9-6
).
Requests also specify memory requirements and CPUs for tasks. By default, each map
and reduce task is allocated 1,024 MB of memory and one virtual core. The values are
configurable on a per-job basis (subject to minimum and maximum values described in
Memory settings in YARN and MapReduce
)
via the following properties:
mapre-
duce.map.memory.mb
,
mapreduce.reduce.memory.mb
,
mapre-
duce.map.cpu.vcores
and
mapreduce.reduce.cpu.vcores
.