How MapReduce Works - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

The setupJob() method is called before the job is run, and is typically used to perform

initialization. For FileOutputCommitter , the method creates the final output direct-

ory, ${mapreduce.output.fileoutputformat.outputdir} , and a tempor-

ary working space for task output, _temporary , as a subdirectory underneath it.

If the job succeeds, the commitJob() method is called, which in the default file-based

implementation deletes the temporary working space and creates a hidden empty marker

file in the output directory called _SUCCESS to indicate to filesystem clients that the job

completed successfully. If the job did not succeed, abortJob() is called with a state

object indicating whether the job failed or was killed (by a user, for example). In the de-

fault implementation, this will delete the job's temporary working space.

The operations are similar at the task level. The setupTask() method is called before

the task is run, and the default implementation doesn't do anything, because temporary

directories named for task outputs are created when the task outputs are written.

The commit phase for tasks is optional and may be disabled by returning false from

needsTaskCommit() . This saves the framework from having to run the distributed

commit protocol for the task, and neither commitTask() nor abortTask() is called.

FileOutputCommitter will skip the commit phase when no output has been written

by a task.

If a task succeeds, commitTask() is called, which in the default implementation moves

the temporary task output directory (which has the task attempt ID in its name to avoid

conflicts between task attempts) to the final output path,

${mapreduce.output.fileoutputformat.outputdir} . Otherwise, the

framework calls abortTask() , which deletes the temporary task output directory.

The framework ensures that in the event of multiple task attempts for a particular task,

only one will be committed; the others will be aborted. This situation may arise because

the first attempt failed for some reason — in which case, it would be aborted, and a later,

successful attempt would be committed. It can also occur if two task attempts were run-

ning concurrently as speculative duplicates; in this instance, the one that finished first

would be committed, and the other would be aborted.

Task side-effect files

The usual way of writing output from map and reduce tasks is by using OutputCol-

lector to collect key-value pairs. Some applications need more flexibility than a single

key-value pair model, so these applications write output files directly from the map or re-

Search WWH ::

Custom Search

Home