ETL with Hadoop - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

The top-left side of the Figure 10-26 interface shows the local repository for the project bd1; from here, I can

double-click the tmr1 job to open it. At the bottom of the interface is a designer and code section. The Code tab

enables me to examine the Java code that Talend generates from the job file; the Designer tab allows me to both

configure each step of the job by selecting it and to run the job once the configuration is completed.

Before I proceed to use the Open Studio interface, I take a moment to consider the test data that this example

job will use. For instance, I have stored two CSV-based data files in the HDFS directory /data/talend/rdbms/, as the

following Hadoop file system ls command shows:

[hadoop@hc2nn ~]$ hdfs dfs -ls /data/talend/rdbms

Found 2 items

-rw-r--r-- 3 hadoop supergroup 1381638 2014-10-10 16:36 /data/talend/rdbms/rawdata.txt

-rw-r--r-- 3 hadoop supergroup 4389 2014-10-18 08:17 /data/talend/rdbms/rawprices.txt

The first file, called rawdata.txt, contains the vehicle model fuel consumption data that has been used in previous

chapter examples, while the second file, called rawprices.txt, contains the matching model prices. The combined

Hadoop file system cat command and the Linux head commands list the first 10 rows of each file, as follows:

[hadoop@hc2nn ~]$ hdfs dfs -cat /data/talend/rdbms/rawdata.txt | head -10

1995,ACURA,INTEGRA,SUBCOMPACT,1.8,4,A4,X,10.2,7,28,40,1760,202

1995,ACURA,INTEGRA,SUBCOMPACT,1.8,4,M5,X,9.6,7,29,40,1680,193

1995,ACURA,INTEGRA GS-R,SUBCOMPACT,1.8,4,M5,Z,9.4,7,30,40,1660,191

1995,ACURA,LEGEND,COMPACT,3.2,6,A4,Z,12.6,8.9,22,32,2180,251

1995,ACURA,LEGEND COUPE,COMPACT,3.2,6,A4,Z,13,9.3,22,30,2260,260

1995,ACURA,LEGEND COUPE,COMPACT,3.2,6,M6,Z,13.4,8.4,21,34,2240,258

1995,ACURA,NSX,TWO-SEATER,3,6,A4,Z,13.5,9.2,21,31,2320,267

1995,ACURA,NSX,TWO-SEATER,3,6,M5,Z,12.9,9,22,31,2220,255

1995,ALFA ROMEO,164 LS,COMPACT,3,6,A4,Z,15.7,10,18,28,2620,301

1995,ALFA ROMEO,164 LS,COMPACT,3,6,M5,Z,13.8,9,20,31,2320,267

[hadoop@hc2nn ~]$ hdfs dfs -cat /data/talend/rdbms/rawprices.txt | head -10

ACURA,INTEGRA,44284

ACURA,INTEGRA GS-R,44284

ACURA,LEGEND,44284

ACURA,LEGEND COUPE,44284

ACURA,NSX,32835

ACURA,2.5TL,44284

ACURA,3.2TL,44284

For my example, I plan to use only columns 2 and 3 from the first file, which contain the manufacturer and model

details, and the price information from the second file. (Note that these prices are test data, not real prices.)

Search WWH ::

Custom Search

Home