ETL with Hadoop - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Figure 10-2. Pentaho Data Integrator, showing big data plug-in structure

The subdirectories shown in the hadoop-configurations directory indicate which Hadoop configuration values

are supported by the pentaho-big-data-plugin. By changing the following line in the file plugin.properties, you set the

configuration:

active.hadoop.configuration=cdh50

For example, the setting shown in Figure 10-2 shows that I have set the pentaho-big-data-plugin for PDI to use

Cloudera's CDH5 (cdh50). Because I have limited memory available on my CDH5 cluster, I decide to run PDI on a

Windows machine and access Hadoop on Linux remotely.

You also need to copy the Hadoop configuration files to the PDI plug-in hadoop directory. From the Cloudera

CDH5 manager home page, you select the YARN (MR2 Included) option, then select the Actions drop-down menu,

followed by Download Client configuration. The zipped file that is downloaded contains the files core-site.xml,

hadoop-env.sh, hdfs-site.xml, hive-site.xml, mapred-site.xml, and yarn-site.xml. Because I am using the CDH5

(cdh50) configuration, I copy these files to the following PDI directory: data-integration\plugins\pentaho-big-

data-plugin\hadoop-configurations\cdh50 .

Pentaho requires Sun/Oracle Java 1.7, which is available at https://java.com/en/download/index.jsp . Be sure

to download and install this on Windows; the cmd.exe Window session output shows my Java installation as follows:

C:\Users\mikejf12>java -version

java version "1.7.0_67"

Java(TM) SE Runtime Environment (build 1.7.0_67-b01)

Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

Search WWH ::

Custom Search

Home