Industry Needs and Solutions - Microsoft Big Data Solutions

Database Reference

In-Depth Information

machine's storage to HDFS. In summary, the job of the DataNode is to

manage all the I/O (that is, read and write requests).

HDFS is also the point of integration for a new Microsoft technology called

Polybase, which you will learn more about in Chapter 10, “Data Warehouses

and Hadoop Integration.”

MapReduce

MapReduce is both an engine and a programming model. Users develop

MapReduce programs and submit them to the MapReduce engine for

processing.Theprogramscreatedbythedevelopersareknownas jobs .Each

job is a combination of Java ARchive (JAR) files and classes required to

execute the MapReduce program. These files are themselves collated into a

single JAR file known as a job file .

Each MapReduce job can be broken down into a few key components. The

first phase of the job is the map . The map breaks the input up into many

tiny pieces so that it can then process each piece independently and in

parallel.Oncecomplete,theresultsfromthisinitialprocesscanbecollected,

aggregated, and processed. This is the reduce part of the job.

The MapReduce engine is used to distribute the workload across the HDFS

cluster and is responsible for the execution of MapReduce jobs. The

MapReduceengineacceptsjobsviatheJobTracker.ThereisoneJobTracker

per Hadoop cluster (the impact of which we discuss shortly). The

JobTracker provides the scheduling and orchestration of the MapReduce

engine; it does not actually process data itself.

To execute a job, the JobTracker communicates with the HDFS NameNode

to determine the location of the data to be analyzed. Once the location

is known, the JobTracker then speaks to another component of the

MapReduce engine called the TaskTracker . There are actually many

TaskTracker nodes in the Hadoop cluster. Each node of the cluster has its

own TaskTracker. Clearly then, the MapReduce engine is another master/

slave architecture.

TaskTrackers provide the execution engine for the MapReduce engine by

spawning a separate process for every task request. Therefore, the

JobTracker must identify the appropriate TaskTrackers to use by assessing

which are available to accept task requests and, ideally, which trackers are

Search WWH ::

Custom Search

Home