Database Reference
In-Depth Information
Learning all about the MapReduce job flow
There are several operations and services involved in the submission and execution of a
MapReduce job in a Hadoop cluster.
The two main services that are responsible for job execution are:
• Jobtracker
• Tasktracker
When a client initiates a job submission to the cluster, a new job ID is created by the job-
tracker and returned to the client. After getting the ID, the job resources along with the in-
formation on the input splits of the data are then copied to HDFS so that all the services in
the cluster can access it. The client then polls the jobtracker every second to check the job's
completion status.
The jobtracker then takes over and initializes the job in the cluster by accessing the job re-
sources in HDFS. The jobtracker retrieves the input splits information and then decides the
tasks that it needs to assign to tasktrackers. The job tracker creates a map task for each of
the input splits and then assigns the map tasks to the tasktrackers. The tasktrackers are also
responsible for running the reduce tasks on completion of the map tasks. The jobtracker
tries to assign map tasks to tasktrackers on nodes that are in close proximity to the data.
This greatly improves performance by limiting the data transferred across the network (data
locality).
The tasktracker is the actual service that runs a task. Tasktrackers are running all the time
and are waiting for tasks to be assigned to them by the jobtracker. Tasktrackers are con-
figured to run a specific number of map and reduce tasks. These are called slots .
The tasktracker sends a periodic heartbeat to the jobtracker to inform it that it is alive along
with the number of map and reduce slots it has available. The jobtracker assigns tasks as a
return value for the heartbeat. Once the task is assigned, the tasktracker copies the client
program (usually a java compiled set of classes, referred to as a jar ) to its local space from
HDFS. All the intermediate data generated by the map task is stored locally on the node
where the tasktracker runs.
After all the map and reduce tasks are completed, the jobtracker receives a notification of
completion. The jobtracker marks the job as successful. The client that polls for the status
of the job prints the completion notification on the client console.
Search WWH ::




Custom Search