ETL with Hadoop - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Figure 10-5. Pentaho Explorer's Design view

Creating ETL

Now that you have a sense of the PDI interface, it's time to examine an example of a Map Reduce task to see how PDI

functions. I create an ETL example by starting with mapper and reducer transformations, and follow with the Map

Reduce job itself. By following my steps you'll learn how each module is configured, as well as gain some tips on how

to avoid pitfalls.

To create my PDI Map Reduce example, I first need some data. The HDFS file (rawdata.txt) should look familiar—

parts of it were used in earlier chapters. Here, I use fuel consumption details for various vehicle models over a number

of years. The data file is CSV-based and resides under HDFS at /data/pentaho/rdbms/. I use the Hadoop file system

cat command to dump the file contents and the Linux head command to limit the data output:

[hadoop@hc2nn ~]$ hdfs dfs -cat /data/pentaho/rdbms/rawdata.txt | head -5

1995,ACURA,INTEGRA,SUBCOMPACT,1.8,4,A4,X,10.2,7,28,40,1760,202

1995,ACURA,INTEGRA,SUBCOMPACT,1.8,4,M5,X,9.6,7,29,40,1680,193

1995,ACURA,INTEGRA GS-R,SUBCOMPACT,1.8,4,M5,Z,9.4,7,30,40,1660,191

1995,ACURA,LEGEND,COMPACT,3.2,6,A4,Z,12.6,8.9,22,32,2180,251

1995,ACURA,LEGEND COUPE,COMPACT,3.2,6,A4,Z,13,9.3,22,30,2260,260

Search WWH ::

Custom Search

Home