Scheduling and Workflow - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

</pig>

</action>

Creating an Oozie Workflow

In this example, I examine and run a Pig- and Hive-based Oozie workflow against Oozie. The example uses

a Canadian vehicle fuel-consumption data set that is provided at the website data.gc.ca . You can either

search for “Fuel Consumption Ratings” to find the data set or use the link http://open.canada.ca/data/en/

To begin, I download the English version of each CSV file. For instance, I have downloaded these files using the

Linux hadoop account, downloading them to that account's Downloads directory, as the Linux ls command shows:

[hadoop@hc1nn Downloads]$ ls

MY1995-1999 Fuel Consumption Ratings.csv MY2007 Fuel Consumption Ratings.csv

MY2000 Fuel Consumption Ratings.csv MY2008 Fuel Consumption Ratings.csv

MY2001 Fuel Consumption Ratings.csv MY2009 Fuel Consumption Ratings.csv

MY2002 Fuel Consumption Ratings.csv MY2010 Fuel Consumption Ratings.csv

MY2003 Fuel Consumption Ratings.csv MY2011 Fuel Consumption Ratings.csv

MY2004 Fuel Consumption Ratings.csv MY2012 Fuel Consumption Ratings.csv

MY2005 Fuel Consumption Ratings.csv MY2013 Fuel Consumption Ratings.csv

MY2006 Fuel Consumption Ratings.csv MY2014 Fuel Consumption Ratings.csv

I then need to copy these files to an HDFS directory so that they can be used by an Oozie workflow job. To do this,

I create some HDFS directories, as follows:

[hadoop@hc1nn Downloads]$ hdfs dfs -mkdir /user/hadoop/oozie_wf

[hadoop@hc1nn Downloads]$ hdfs dfs -mkdir /user/hadoop/oozie_wf/fuel

[hadoop@hc1nn Downloads]$ hdfs dfs -mkdir /user/hadoop/oozie_wf/fuel/rawdata

[hadoop@hc1nn Downloads]$ hdfs dfs -mkdir /user/hadoop/oozie_wf/fuel/pigwf

[hadoop@hc1nn Downloads]$ hdfs dfs -mkdir /user/hadoop/oozie_wf/fuel/entity

[hadoop@hc1nn Downloads]$ hdfs dfs -mkdir /user/hadoop/oozie_wf/fuel/entity/manufacturer

[hadoop@hc1nn Downloads]$ hdfs dfs -mkdir /user/hadoop/oozie_wf/fuel/entity/model

The Hadoop file system ls command produces a long list that shows the three HDFS subdirectories I've just

created and that will be used in this example.

[hadoop@hc1nn Downloads]$ hdfs dfs -ls /user/hadoop/oozie_wf/fuel/

Found 3 items

drwxr-xr-x - hadoop hadoop 0 2014-07-12 18:16 /user/hadoop/oozie_wf/fuel/entity

drwxr-xr-x - hadoop hadoop 0 2014-07-12 18:15 /user/hadoop/oozie_wf/fuel/pigwf

drwxr-xr-x - hadoop hadoop 0 2014-07-08 18:16 /user/hadoop/oozie_wf/fuel/rawdata

I employ the rawdata directory under /user/hadoop/oozie_wf/fuel/ on HDFS to contain the CSV data that I

will use. I use the pigwf directory to contain the scripts for the task. I use the entity directory and its subdirectories to

contain the data used by this task.

Search WWH ::

Custom Search

Home