MapReduce - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Scaling Out

You've seen how MapReduce works for small inputs; now it's time to take a bird's-eye

view of the system and look at the data flow for large inputs. For simplicity, the examples

so far have used files on the local filesystem. However, to scale out, we need to store the

data in a distributed filesystem (typically HDFS, which you'll learn about in the next

chapter). This allows Hadoop to move the MapReduce computation to each machine host-

ing a part of the data, using Hadoop's resource management system, called YARN (see

Chapter 4 ) . Let's see how this works.

Data Flow

First, some terminology. A MapReduce job is a unit of work that the client wants to be per-

formed: it consists of the input data, the MapReduce program, and configuration informa-

tion. Hadoop runs the job by dividing it into tasks , of which there are two types: map tasks

and reduce tasks . The tasks are scheduled using YARN and run on nodes in the cluster. If a

task fails, it will be automatically rescheduled to run on a different node.

Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits , or

just splits . Hadoop creates one map task for each split, which runs the user-defined map

function for each record in the split.

Having many splits means the time taken to process each split is small compared to the

time to process the whole input. So if we are processing the splits in parallel, the processing

is better load balanced when the splits are small, since a faster machine will be able to pro-

cess proportionally more splits over the course of the job than a slower machine. Even if

the machines are identical, failed processes or other jobs running concurrently make load

balancing desirable, and the quality of the load balancing increases as the splits become

more fine grained.

On the other hand, if splits are too small, the overhead of managing the splits and map task

creation begins to dominate the total job execution time. For most jobs, a good split size

tends to be the size of an HDFS block, which is 128 MB by default, although this can be

changed for the cluster (for all newly created files) or specified when each file is created.

Hadoop does its best to run the map task on a node where the input data resides in HDFS,

because it doesn't use valuable cluster bandwidth. This is called the data locality optimiza-

tion . Sometimes, however, all the nodes hosting the HDFS block replicas for a map task's

input split are running other map tasks, so the job scheduler will look for a free map slot on

a node in the same rack as one of the blocks. Very occasionally even this is not possible, so

an off-rack node is used, which results in an inter-rack network transfer. The three possibil-

ities are illustrated in Figure 2-2 .

Search WWH ::

Custom Search

Home