Database Reference
In-Depth Information
There is a good case for turning off speculative execution for reduce tasks, since any du-
plicate reduce tasks have to fetch the same map outputs as the original task, and this can
significantly increase network traffic on the cluster.
Another reason for turning off speculative execution is for nonidempotent tasks. However,
in many cases it is possible to write tasks to be idempotent and use an
OutputCommit-
ter
to promote the output to its final location when the task succeeds. This technique is
explained in more detail in the next section.
Output Committers
Hadoop MapReduce uses a commit protocol to ensure that jobs and tasks either succeed
or fail cleanly. The behavior is implemented by the
OutputCommitter
in use for the
job, which is set in the old MapReduce API by calling the
setOutputCommitter()
on
JobConf
or by setting
mapred.output.committer.class
in the configura-
tion. In the new MapReduce API, the
OutputCommitter
is determined by the
Out-
putFormat
, via its
getOutputCommitter()
method. The default is
FileOut-
putCommitter
, which is appropriate for file-based MapReduce. You can customize an
existing
OutputCommitter
or even write a new implementation if you need to do spe-
cial setup or cleanup for jobs or tasks.
The
OutputCommitter
API is as follows (in both the old and new MapReduce APIs):
public abstract class
OutputCommitter
{
public abstract
void
setupJob
(
JobContext jobContext
)
throws
IOException
;
public
void
commitJob
(
JobContext jobContext
)
throws
IOException
{ }
public
void
abortJob
(
JobContext jobContext
,
JobStatus
.
State
state
)
throws
IOException
{ }
public abstract
void
setupTask
(
TaskAttemptContext taskContext
)
throws
IOException
;
public abstract
boolean
needsTaskCommit
(
TaskAttemptContext
taskContext
)
throws
IOException
;
public abstract
void
commitTask
(
TaskAttemptContext taskContext
)
throws
IOException
;
public abstract
void
abortTask
(
TaskAttemptContext taskContext
)
throws
IOException
;
}
}