Scheduling and Workflow - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

The sql just creates an external Hive table called rawdata2 from the manufacturer HDFS-based files. It then

creates a second table in Hive called “manufacturer” by selecting the contents of the rawdata2 table.

The model.pig and sql files are very similar, pulling vehicle model data from the HDFS-based rawdata files and

moving it to HDFS. I use the Hadoop file system cat command to display the model.pig file:

[hadoop@hc1nn fuel]$ hdfs dfs -cat /user/hadoop/oozie_wf/fuel/pigwf/model.pig

01 -- get the raw data from the files from the csv files

02

03 rlines = LOAD '/user/hadoop/oozie_wf/fuel/rawdata/*.csv' USING PigStorage(',') AS

04 ( year:int, manufacturer:chararray, model:chararray, class:chararray, size:float,

cylinders:int,

05 transmission:chararray, fuel:chararray, cons_cityl100:float, cond_hwyl100:float, cons_

citympgs:int,

06 cond_hwympgs:int, lyears:int, co2s:int

07 );

08

09 mlist = FOREACH rlines GENERATE manufacturer,year,model ;

10

11 dlist = DISTINCT mlist ;

12

13 STORE dlist INTO '/user/hadoop/oozie_wf/fuel/entity/model/' using PigStorage(',');

Again, it strips vehicle model information from the HDFS-based CSV files in the rawdata directory. It then

stores that information in the entity/model HDFS directory. The model.sql script then processes that information to

a Hive table:

[hadoop@hc1nn fuel]$ hdfs dfs -cat /user/hadoop/oozie_wf/fuel/pigwf/model.sql

01 drop table if exists rawdata2 ;

02

03 create external table rawdata2 (

04 line string

05 )

06 location '/user/hadoop/oozie_wf/fuel/entity/model/' ;

07

08 drop table if exists model ;

09

10 create table model as

11 select

12 distinct split(line,',')

13 from rawdata2

14 where

15 line not like '%=%' ;

The Hive QL script creates an external table over the HDFS-based entity/model data, called rawdata2; it then

selects that data into a Hive-based table called “model.”

The intention of this workflow example is to show that complex ETL (“extract, transform, load”) chains of

subtasks can be built using Oozie. The tasks can be run in parallel and control can be added to the workflow to set up

the jobs and define the end conditions. Having described the workflow, it is now time for me to run the job; the next

section explains how the workflow can be run and monitored with the Oozie web-based user interface.

Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Search WWH ::

Custom Search

Home