Database Reference
In-Depth Information
The sql just creates an external Hive table called rawdata2 from the manufacturer HDFS-based files. It then
creates a second table in Hive called “manufacturer” by selecting the contents of the rawdata2 table.
The model.pig and sql files are very similar, pulling vehicle model data from the HDFS-based rawdata files and
moving it to HDFS. I use the Hadoop file system cat command to display the model.pig file:
[hadoop@hc1nn fuel]$ hdfs dfs -cat /user/hadoop/oozie_wf/fuel/pigwf/model.pig
01 -- get the raw data from the files from the csv files
02
03 rlines = LOAD '/user/hadoop/oozie_wf/fuel/rawdata/*.csv' USING PigStorage(',') AS
04 ( year:int, manufacturer:chararray, model:chararray, class:chararray, size:float,
cylinders:int,
05 transmission:chararray, fuel:chararray, cons_cityl100:float, cond_hwyl100:float, cons_
citympgs:int,
06 cond_hwympgs:int, lyears:int, co2s:int
07 );
08
09 mlist = FOREACH rlines GENERATE manufacturer,year,model ;
10
11 dlist = DISTINCT mlist ;
12
13 STORE dlist INTO '/user/hadoop/oozie_wf/fuel/entity/model/' using PigStorage(',');
Again, it strips vehicle model information from the HDFS-based CSV files in the rawdata directory. It then
stores that information in the entity/model HDFS directory. The model.sql script then processes that information to
a Hive table:
[hadoop@hc1nn fuel]$ hdfs dfs -cat /user/hadoop/oozie_wf/fuel/pigwf/model.sql
01 drop table if exists rawdata2 ;
02
03 create external table rawdata2 (
04 line string
05 )
06 location '/user/hadoop/oozie_wf/fuel/entity/model/' ;
07
08 drop table if exists model ;
09
10 create table model as
11 select
12 distinct split(line,',')
13 from rawdata2
14 where
15 line not like '%=%' ;
The Hive QL script creates an external table over the HDFS-based entity/model data, called rawdata2; it then
selects that data into a Hive-based table called “model.”
The intention of this workflow example is to show that complex ETL (“extract, transform, load”) chains of
subtasks can be built using Oozie. The tasks can be run in parallel and control can be added to the workflow to set up
the jobs and define the end conditions. Having described the workflow, it is now time for me to run the job; the next
section explains how the workflow can be run and monitored with the Oozie web-based user interface.
 
Search WWH ::




Custom Search