ETL with Hadoop - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Next, I click the Job Setup tab to specify the input and output paths for the job data, a shown in Figure 10-19 . The

input data file is stored on HDFS at /data/pentaho/rdbms, as explained earlier.

Figure 10-19. Job Setup tab for job pmr1

The input and output data formats for this job are defined as Hadoop Map Reduce based Java classes, such as

org.apache.hadoop.mapred.TextOutputFormat . The Clean option is selected so that the job can be rerun. That is,

each time the job runs, it will clean out the results directory.

Lastly, I define the connection to the Hadoop cluster using the Cluster tab. As you can see in Figure 10-20 , the

only fields that I have changed in this tab are the hostnames and ports, so that Pentaho knows which hosts to connect

to (hc2nn) for HDFS and Map Reduce. I have also specified the ports, 8020 for HDFS and 8032 for the Resource

Manager (which is actually labeled as the Job Tracker, but this is a CDH5 cluster using YARN).

Search WWH ::

Custom Search

Home