Database Reference
In-Depth Information
main consideration is whether you have a linear chain of jobs or a more complex directed
acyclic graph (DAG) of jobs.
For a linear chain, the simplest approach is to run each job one after another, waiting until
a job completes successfully before running the next:
JobClient
.
runJob
(
conf1
);
JobClient
.
runJob
(
conf2
);
If a job fails, the
runJob()
method will throw an
IOException
, so later jobs in the
pipeline don't get executed. Depending on your application, you might want to catch the
exception and clean up any intermediate data that was produced by any previous jobs.
The approach is similar with the new MapReduce API, except you need to examine the
Boolean return value of the
waitForCompletion()
method on
Job
:
true
means
the job succeeded, and
false
means it failed.
For anything more complex than a linear chain, there are libraries that can help orchestrate
your workflow (although they are also suited to linear chains, or even one-off jobs). The
simplest is in the
org.apache.hadoop.mapreduce.jobcontrol
package: the
JobControl
class. (There is an equivalent class in the
org.apache.hadoop.mapred.jobcontrol
package, too.) An instance of
JobControl
represents a graph of jobs to be run. You add the job configurations, then
tell the
JobControl
instance the dependencies between jobs. You run the
JobCon-
trol
in a thread, and it runs the jobs in dependency order. You can poll for progress, and
when the jobs have finished, you can query for all the jobs' statuses and the associated er-
rors for any failures. If a job fails,
JobControl
won't run its dependencies.
Apache Oozie
Apache Oozie is a system for running workflows of dependent jobs. It is composed of two
main parts: a
workflow engine
that stores and runs workflows composed of different types
of Hadoop jobs (MapReduce, Pig, Hive, and so on), and a
coordinator engine
that runs
workflow jobs based on predefined schedules and data availability. Oozie has been de-
signed to scale, and it can manage the timely execution of thousands of workflows in a
Hadoop cluster, each composed of possibly dozens of constituent jobs.
Oozie makes rerunning failed workflows more tractable, since no time is wasted running
successful parts of a workflow. Anyone who has managed a complex batch system knows
how difficult it can be to catch up from jobs missed due to downtime or failure, and will
appreciate this feature. (Furthermore, coordinator applications representing a single data
pipeline may be packaged into a
bundle
and run together as a unit.)