Database Reference
In-Depth Information
here ) , computes its output and stores it in HDFS, but some tasks will be more complicated.
For example, you may have a job that requires that two or three other jobs finish, and each of
these require that data is loaded into HDFS from some external source. And you may want to
run this job on a periodic basis. Of course, you could orchestrate this manually or by some
clever scripting, but there is an easier way.
That way is Oozie, Hadoop's workflow scheduler. It's a bit complicated at first, but has some
useful power to start, stop, suspend, and restart jobs, and control the workflow so that no task
within the complete job runs before the tasks and objects it requires are ready. Oozie puts its
actions (jobs and tasks) in a directed acyclic graph (DAG) that describe what actions depend
upon previous actions completing successfully. This is defined in a large XML file (actually
hPDL, Hadoop Process Definition Language). The file is too large to display here for any
nontrivial example, but the tutorials and Oozie site have examples.
What is a DAG? A graph is a collection of nodes and arcs. Nodes represent states or objects.
Arcs connect the nodes. If an arc has an arrow at either end, then that arc is directed and the
direction of the arrow indicates the direction. In Oozie, the nodes are the actions, such as to
run a job, fork, fail, or end. The arcs show which actions flow into others. It's directed to
show the ordering of the actions and decision or controls—that is, what nodes must run jobs
or whether events precede or follow (e.g., a file object must be present before a Pig script is
run). Acyclic means that in traversing the graph, once you leave a node, you cannot get back
there. That would be a cycle. An implication of this is that Oozie cannot be used to iterate
through a set of nodes until a condition is met (i.e., there are no while loops). There is more
information about graphs in “Giraph” .
Figure 4-1 is a graphic example of an Oozie flow in which a Hive job requires the output of
both a Pig job and a MapReduce job, both of which require external files to be present.
Search WWH ::




Custom Search