Database Reference
In-Depth Information
Apache Oozie
When dealing with big data processing, the task of processing is broken down into several
jobs. These jobs need to be executed in a specific sequence to achieve the desired output.
Executing these jobs manually would be very tedious. The coordination and scheduling of
jobs is called a
workflow
. Apache Oozie is a data workflow management system for
Apache Hadoop. Different types of jobs such as MapReduce, Hive, Pig, Sqoop, or custom
jobs such as Java programs can be scheduled and coordinated using Oozie.
An Oozie workflow consists of action nodes and control nodes. An action node is a node
that executes a specific process, for example, a MapReduce job. Control nodes are nodes
that help in controlling the workflow, for example, the start node, end node, and fail node.
The configuration of Oozie workflows is done using
Hadoop Process Definition Lan-
guage
(
hPDL
). hPDL is an XML-based definition language.
The following diagram shows a sample Oozie workflow: