Developing a MapReduce Application - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

main consideration is whether you have a linear chain of jobs or a more complex directed

acyclic graph (DAG) of jobs.

For a linear chain, the simplest approach is to run each job one after another, waiting until

a job completes successfully before running the next:

JobClient . runJob ( conf1 );

JobClient . runJob ( conf2 );

If a job fails, the runJob() method will throw an IOException , so later jobs in the

pipeline don't get executed. Depending on your application, you might want to catch the

exception and clean up any intermediate data that was produced by any previous jobs.

The approach is similar with the new MapReduce API, except you need to examine the

Boolean return value of the waitForCompletion() method on Job : true means

the job succeeded, and false means it failed.

For anything more complex than a linear chain, there are libraries that can help orchestrate

your workflow (although they are also suited to linear chains, or even one-off jobs). The

simplest is in the org.apache.hadoop.mapreduce.jobcontrol package: the

JobControl class. (There is an equivalent class in the

org.apache.hadoop.mapred.jobcontrol package, too.) An instance of

JobControl represents a graph of jobs to be run. You add the job configurations, then

tell the JobControl instance the dependencies between jobs. You run the JobCon-

trol in a thread, and it runs the jobs in dependency order. You can poll for progress, and

when the jobs have finished, you can query for all the jobs' statuses and the associated er-

rors for any failures. If a job fails, JobControl won't run its dependencies.

Apache Oozie

Apache Oozie is a system for running workflows of dependent jobs. It is composed of two

main parts: a workflow engine that stores and runs workflows composed of different types

of Hadoop jobs (MapReduce, Pig, Hive, and so on), and a coordinator engine that runs

workflow jobs based on predefined schedules and data availability. Oozie has been de-

signed to scale, and it can manage the timely execution of thousands of workflows in a

Hadoop cluster, each composed of possibly dozens of constituent jobs.

Oozie makes rerunning failed workflows more tractable, since no time is wasted running

successful parts of a workflow. Anyone who has managed a complex batch system knows

how difficult it can be to catch up from jobs missed due to downtime or failure, and will

appreciate this feature. (Furthermore, coordinator applications representing a single data

pipeline may be packaged into a bundle and run together as a unit.)

Search WWH ::

Custom Search

Home