Database Reference
In-Depth Information
Scaling Out
You've seen how MapReduce works for small inputs; now it's time to take a bird's-eye
view of the system and look at the data flow for large inputs. For simplicity, the examples
so far have used files on the local filesystem. However, to scale out, we need to store the
data in a distributed filesystem (typically HDFS, which you'll learn about in the next
chapter). This allows Hadoop to move the MapReduce computation to each machine host-
ing a part of the data, using Hadoop's resource management system, called YARN (see
Chapter 4 ) . Let's see how this works.
Data Flow
First, some terminology. A MapReduce job is a unit of work that the client wants to be per-
formed: it consists of the input data, the MapReduce program, and configuration informa-
tion. Hadoop runs the job by dividing it into tasks , of which there are two types: map tasks
and reduce tasks . The tasks are scheduled using YARN and run on nodes in the cluster. If a
task fails, it will be automatically rescheduled to run on a different node.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits , or
just splits . Hadoop creates one map task for each split, which runs the user-defined map
function for each record in the split.
Having many splits means the time taken to process each split is small compared to the
time to process the whole input. So if we are processing the splits in parallel, the processing
is better load balanced when the splits are small, since a faster machine will be able to pro-
cess proportionally more splits over the course of the job than a slower machine. Even if
the machines are identical, failed processes or other jobs running concurrently make load
balancing desirable, and the quality of the load balancing increases as the splits become
more fine grained.
On the other hand, if splits are too small, the overhead of managing the splits and map task
creation begins to dominate the total job execution time. For most jobs, a good split size
tends to be the size of an HDFS block, which is 128 MB by default, although this can be
changed for the cluster (for all newly created files) or specified when each file is created.
Hadoop does its best to run the map task on a node where the input data resides in HDFS,
because it doesn't use valuable cluster bandwidth. This is called the data locality optimiza-
tion . Sometimes, however, all the nodes hosting the HDFS block replicas for a map task's
input split are running other map tasks, so the job scheduler will look for a free map slot on
a node in the same rack as one of the blocks. Very occasionally even this is not possible, so
an off-rack node is used, which results in an inter-rack network transfer. The three possibil-
ities are illustrated in Figure 2-2 .
Search WWH ::




Custom Search