How MapReduce Works - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

cessfully, the job counters are displayed. Otherwise, the error that caused the job to fail is

logged to the console.

The job submission process implemented by JobSubmitter does the following:

▪ Asks the resource manager for a new application ID, used for the MapReduce job

ID (step 2).

▪ Checks the output specification of the job. For example, if the output directory

has not been specified or it already exists, the job is not submitted and an error is

thrown to the MapReduce program.

▪ Computes the input splits for the job. If the splits cannot be computed (because

the input paths don't exist, for example), the job is not submitted and an error is

thrown to the MapReduce program.

▪ Copies the resources needed to run the job, including the job JAR file, the config-

uration file, and the computed input splits, to the shared filesystem in a directory

named after the job ID (step 3). The job JAR is copied with a high replication

factor (controlled by the mapre-

duce.client.submit.file.replication property, which defaults to

10) so that there are lots of copies across the cluster for the node managers to ac-

cess when they run tasks for the job.

▪ Submits the job by calling submitApplication() on the resource manager

(step 4).

Job Initialization

When the resource manager receives a call to its submitApplication() method, it

hands off the request to the YARN scheduler. The scheduler allocates a container, and the

resource manager then launches the application master's process there, under the node

manager's management (steps 5a and 5b).

The application master for MapReduce jobs is a Java application whose main class is

MRAppMaster . It initializes the job by creating a number of bookkeeping objects to

keep track of the job's progress, as it will receive progress and completion reports from

the tasks (step 6). Next, it retrieves the input splits computed in the client from the shared

filesystem (step 7). It then creates a map task object for each split, as well as a number of

reduce task objects determined by the mapreduce.job.reduces property (set by the

setNumReduceTasks() method on Job ). Tasks are given IDs at this point.

The application master must decide how to run the tasks that make up the MapReduce

job. If the job is small, the application master may choose to run the tasks in the same

JVM as itself. This happens when it judges that the overhead of allocating and running

Search WWH ::

Custom Search

Home