Database Reference
In-Depth Information
duce task to a distributed filesystem, such as HDFS. (There are other ways to produce
multiple outputs, too, as described in Multiple Outputs . )
Care needs to be taken to ensure that multiple instances of the same task don't try to write
to the same file. As we saw in the previous section, the OutputCommitter protocol
solves this problem. If applications write side files in their tasks' working directories, the
side files for tasks that successfully complete will be promoted to the output directory
automatically, whereas failed tasks will have their side files deleted.
A task may find its working directory by retrieving the value of the mapre-
duce.task.output.dir property from the job configuration. Alternatively, a
MapReduce program using the Java API may call the getWorkOutputPath() static
method on FileOutputFormat to get the Path object representing the working dir-
ectory. The framework creates the working directory before executing the task, so you
don't need to create it.
To take a simple example, imagine a program for converting image files from one format
to another. One way to do this is to have a map-only job, where each map is given a set of
images to convert (perhaps using NLineInputFormat ; see NLineInputFormat ) . If a
map task writes the converted images into its working directory, they will be promoted to
the output directory when the task successfully finishes.
[ 51 ] In the old MapReduce API, you can call JobClient.submitJob(conf) or JobCli-
ent.runJob(conf) .
[ 52 ] Not discussed in this section are the job history server daemon (for retaining job history data) and the
shuffle handler auxiliary service (for serving map outputs to reduce tasks).
[ 53 ] If a Streaming process hangs, the node manager will kill it (along with the JVM that launched it) only in
the following circumstances: either yarn.nodemanager.container-executor.class is set to
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor , or the de-
fault container executor is being used and the setsid command is available on the system (so that the task
JVM and any processes it launches are in the same process group). In any other case, orphaned Streaming
processes will accumulate on the system, which will impact utilization over time.
[ 54 ] The term shuffle is actually imprecise, since in some contexts it refers to only the part of the process
where map outputs are fetched by reduce tasks. In this section, we take it to mean the whole process, from
the point where a map produces output to where a reduce consumes input.
Search WWH ::




Custom Search