Database Reference
In-Depth Information
main consideration is whether you have a linear chain of jobs or a more complex directed
acyclic graph (DAG) of jobs.
For a linear chain, the simplest approach is to run each job one after another, waiting until
a job completes successfully before running the next:
JobClient . runJob ( conf1 );
JobClient . runJob ( conf2 );
If a job fails, the runJob() method will throw an IOException , so later jobs in the
pipeline don't get executed. Depending on your application, you might want to catch the
exception and clean up any intermediate data that was produced by any previous jobs.
The approach is similar with the new MapReduce API, except you need to examine the
Boolean return value of the waitForCompletion() method on Job : true means
the job succeeded, and false means it failed.
For anything more complex than a linear chain, there are libraries that can help orchestrate
your workflow (although they are also suited to linear chains, or even one-off jobs). The
simplest is in the org.apache.hadoop.mapreduce.jobcontrol package: the
JobControl class. (There is an equivalent class in the
org.apache.hadoop.mapred.jobcontrol package, too.) An instance of
JobControl represents a graph of jobs to be run. You add the job configurations, then
tell the JobControl instance the dependencies between jobs. You run the JobCon-
trol in a thread, and it runs the jobs in dependency order. You can poll for progress, and
when the jobs have finished, you can query for all the jobs' statuses and the associated er-
rors for any failures. If a job fails, JobControl won't run its dependencies.
Apache Oozie
Apache Oozie is a system for running workflows of dependent jobs. It is composed of two
main parts: a workflow engine that stores and runs workflows composed of different types
of Hadoop jobs (MapReduce, Pig, Hive, and so on), and a coordinator engine that runs
workflow jobs based on predefined schedules and data availability. Oozie has been de-
signed to scale, and it can manage the timely execution of thousands of workflows in a
Hadoop cluster, each composed of possibly dozens of constituent jobs.
Oozie makes rerunning failed workflows more tractable, since no time is wasted running
successful parts of a workflow. Anyone who has managed a complex batch system knows
how difficult it can be to catch up from jobs missed due to downtime or failure, and will
appreciate this feature. (Furthermore, coordinator applications representing a single data
pipeline may be packaged into a bundle and run together as a unit.)
Search WWH ::




Custom Search