Scheduling and Workflow - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

11 oozie.libpath=${nameNode}/user/hadoop/share/lib

12 oozie.use.system.libpath=true

13 oozie.wf.rerun.failnodes=true

14

15 hdfsUser=hadoop

16 wfProject=fuel

17 hdfsWfHome=${nameNode}/user/${hdfsUser}/oozie_wf/${wfProject}

18 hdfsRawData=${hdfsWfHome}/rawdata

19 hdfsEntityData=${hdfsWfHome}/entity

20

21 oozie.wf.application.path=${hdfsWfHome}/pigwf

22 oozieWfPath=${hdfsWfHome}/pigwf/

The parameters in this file specify the Hadoop name node by server and port. Because YARN is being employed,

the Resource Manager is defined via its host and port by using the JobTracker variable. Job Tracker is obviously a

Hadoop V1 component name, but this works for YARN. The queue name to be used for this workflow, high_pool , is

also specified.

The library path of the Oozie shared library is defined by oozie.libpath, along with the parameter oozie.use.

system.libpath . The HDFS user for the job is specified, as is a project name. Finally, the paths are defined for the

workflow scripts and entity data that will be produced. The special variable oozie.wf.application.path is used to

define the location of the workflow job file.

The workflow.txt file is the main control file for the workflow job. It controls the flow of actions, via Oozie, and

manages the subtasks. This workflow file runs two parallel streams of processing to process the data in the HDFS

rawdata directory.

The manufacturer.pig script is called to strip manufacturer-based data from the HDFS-based rawdata files. This

data is placed in the HDFS-based entity/manufacturer directory. Then the script manufacturer.sql is called to process

this data to the Hive data warehouse.

In parallel to this (via a fork option in the xml), the model.pig script is called to strip the vehicle model-based

data from the HDFS rawdata files. This data is placed in the HDFS entity/model directory. Then the script model.sql is

called to process this data to the Hive data warehouse.

The workflow.xml workflow file has been built using a combination of the workflow elements described earlier

(see “The Mechanics of the Oozie Workflow”). I have used the Hadoop file system cat command to display its contents:

[hadoop@hc1nn fuel]$ hdfs dfs -cat /user/hadoop/oozie_wf/fuel/pigwf/workflow.xml

01 <workflow-app name="FuelWorkFlow" xmlns="uri:Oozie workflow:workflow:0.1">

02

03 <start to="pig-fork"/>

04

05 <fork name="pig-fork">

06 <path start="pig-manufacturer"/>

07 <path start="pig-model"/>

08 </fork>

09

10 <action name="pig-manufacturer">

11 <pig>

12 <job-tracker>${jobTracker}</job-tracker>

13 <name-node>${nameNode}</name-node>

14 <prepare>

15 <delete path="${hdfsEntityData}/manufacturer"/>

16 </prepare>

Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Search WWH ::

Custom Search

Home