ETL with Hadoop - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Creating the ETL

Examining a completed job is a good way to understand Open Studio's modules and workflow. Figure 10-27 shows

job tmr1, the Pig-based Map Reduce job I created for this example. It starts with a connection to HDFS called

tHDFSConnection_1, on the condition that the connection works control is passed to an HDFS delete step called

tHDFSDelete_1, which clears the results directory for the job. If that is okay, then control is passed to a tPigLoad step

called tPigLoad_1, which loads the rawdata.txt file from HDFS. At the same time, another load step called tPigLoad_2

loads the rawprices.txt file from HDFS. The data from these files is then passed to a module called tPigMap_1, which

will combine the data.

Figure 10-27. Work flow for Pig native Map Reduce job

Figure 10-27 shows that the tPigMap_1 step combines the data from the two files, while the tPigFilterColumns_1

step removes the column data that is not of interest. The tPigSort_1 step sorts the data, then the tPigStoreResult_1 step

saves the sorted data to HDFS.

To create your own jobs, you select modules in the palette and drag them to the center jobs pane. Then, you

right-click the icons to connect the modules via conditional arrows or arrows that represent data flows (as was

done with Pentaho earlier). Figure 10-28 shows the creation of a conditional flow between steps of a job. If the

tHDFSConnection_1 step works, then control will pass to the deletion step. Figure 10-29 illustrates the creation of a

specific data flow between the tPigLoad_1 and tPigMap_1 steps.

Figure 10-28. Conditional job flow

Search WWH ::

Custom Search

Home