How MapReduce Works - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

duce task to a distributed filesystem, such as HDFS. (There are other ways to produce

multiple outputs, too, as described in Multiple Outputs . )

Care needs to be taken to ensure that multiple instances of the same task don't try to write

to the same file. As we saw in the previous section, the OutputCommitter protocol

solves this problem. If applications write side files in their tasks' working directories, the

side files for tasks that successfully complete will be promoted to the output directory

automatically, whereas failed tasks will have their side files deleted.

A task may find its working directory by retrieving the value of the mapre-

duce.task.output.dir property from the job configuration. Alternatively, a

MapReduce program using the Java API may call the getWorkOutputPath() static

method on FileOutputFormat to get the Path object representing the working dir-

ectory. The framework creates the working directory before executing the task, so you

don't need to create it.

To take a simple example, imagine a program for converting image files from one format

to another. One way to do this is to have a map-only job, where each map is given a set of

images to convert (perhaps using NLineInputFormat ; see NLineInputFormat ) . If a

map task writes the converted images into its working directory, they will be promoted to

the output directory when the task successfully finishes.

[ 51 ] In the old MapReduce API, you can call JobClient.submitJob(conf) or JobCli-

ent.runJob(conf) .

[ 52 ] Not discussed in this section are the job history server daemon (for retaining job history data) and the

shuffle handler auxiliary service (for serving map outputs to reduce tasks).

[ 53 ] If a Streaming process hangs, the node manager will kill it (along with the JVM that launched it) only in

the following circumstances: either yarn.nodemanager.container-executor.class is set to

org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor , or the de-

fault container executor is being used and the setsid command is available on the system (so that the task

JVM and any processes it launches are in the same process group). In any other case, orphaned Streaming

processes will accumulate on the system, which will impact utilization over time.

[ 54 ] The term shuffle is actually imprecise, since in some contexts it refers to only the part of the process

where map outputs are fetched by reduce tasks. In this section, we take it to mean the whole process, from

the point where a map produces output to where a reduce consumes input.

Search WWH ::

Custom Search

Home