HDFS and MapReduce - Cloudera Administration

Database Reference

In-Depth Information

Learning all about the MapReduce job flow

There are several operations and services involved in the submission and execution of a

MapReduce job in a Hadoop cluster.

The two main services that are responsible for job execution are:

• Jobtracker

• Tasktracker

When a client initiates a job submission to the cluster, a new job ID is created by the job-

tracker and returned to the client. After getting the ID, the job resources along with the in-

formation on the input splits of the data are then copied to HDFS so that all the services in

the cluster can access it. The client then polls the jobtracker every second to check the job's

completion status.

The jobtracker then takes over and initializes the job in the cluster by accessing the job re-

sources in HDFS. The jobtracker retrieves the input splits information and then decides the

tasks that it needs to assign to tasktrackers. The job tracker creates a map task for each of

the input splits and then assigns the map tasks to the tasktrackers. The tasktrackers are also

responsible for running the reduce tasks on completion of the map tasks. The jobtracker

tries to assign map tasks to tasktrackers on nodes that are in close proximity to the data.

This greatly improves performance by limiting the data transferred across the network (data

locality).

The tasktracker is the actual service that runs a task. Tasktrackers are running all the time

and are waiting for tasks to be assigned to them by the jobtracker. Tasktrackers are con-

figured to run a specific number of map and reduce tasks. These are called slots .

The tasktracker sends a periodic heartbeat to the jobtracker to inform it that it is alive along

with the number of map and reduce slots it has available. The jobtracker assigns tasks as a

return value for the heartbeat. Once the task is assigned, the tasktracker copies the client

program (usually a java compiled set of classes, referred to as a jar ) to its local space from

HDFS. All the intermediate data generated by the map task is stored locally on the node

where the tasktracker runs.

After all the map and reduce tasks are completed, the jobtracker receives a notification of

completion. The jobtracker marks the job as successful. The client that polls for the status

of the job prints the completion notification on the client console.

Search WWH ::

Custom Search

Home