Database Reference
In-Depth Information
The setupJob() method is called before the job is run, and is typically used to perform
initialization. For FileOutputCommitter , the method creates the final output direct-
ory, ${mapreduce.output.fileoutputformat.outputdir} , and a tempor-
ary working space for task output, _temporary , as a subdirectory underneath it.
If the job succeeds, the commitJob() method is called, which in the default file-based
implementation deletes the temporary working space and creates a hidden empty marker
file in the output directory called _SUCCESS to indicate to filesystem clients that the job
completed successfully. If the job did not succeed, abortJob() is called with a state
object indicating whether the job failed or was killed (by a user, for example). In the de-
fault implementation, this will delete the job's temporary working space.
The operations are similar at the task level. The setupTask() method is called before
the task is run, and the default implementation doesn't do anything, because temporary
directories named for task outputs are created when the task outputs are written.
The commit phase for tasks is optional and may be disabled by returning false from
needsTaskCommit() . This saves the framework from having to run the distributed
commit protocol for the task, and neither commitTask() nor abortTask() is called.
FileOutputCommitter will skip the commit phase when no output has been written
by a task.
If a task succeeds, commitTask() is called, which in the default implementation moves
the temporary task output directory (which has the task attempt ID in its name to avoid
conflicts between task attempts) to the final output path,
${mapreduce.output.fileoutputformat.outputdir} . Otherwise, the
framework calls abortTask() , which deletes the temporary task output directory.
The framework ensures that in the event of multiple task attempts for a particular task,
only one will be committed; the others will be aborted. This situation may arise because
the first attempt failed for some reason — in which case, it would be aborted, and a later,
successful attempt would be committed. It can also occur if two task attempts were run-
ning concurrently as speculative duplicates; in this instance, the one that finished first
would be committed, and the other would be aborted.
Task side-effect files
The usual way of writing output from map and reduce tasks is by using OutputCol-
lector to collect key-value pairs. Some applications need more flexibility than a single
key-value pair model, so these applications write output files directly from the map or re-
Search WWH ::




Custom Search