Database Reference
In-Depth Information
Figure 10-2. Pentaho Data Integrator, showing big data plug-in structure
The subdirectories shown in the hadoop-configurations directory indicate which Hadoop configuration values
are supported by the pentaho-big-data-plugin. By changing the following line in the file plugin.properties, you set the
configuration:
active.hadoop.configuration=cdh50
For example, the setting shown in Figure 10-2 shows that I have set the pentaho-big-data-plugin for PDI to use
Cloudera's CDH5 (cdh50). Because I have limited memory available on my CDH5 cluster, I decide to run PDI on a
Windows machine and access Hadoop on Linux remotely.
You also need to copy the Hadoop configuration files to the PDI plug-in hadoop directory. From the Cloudera
CDH5 manager home page, you select the YARN (MR2 Included) option, then select the Actions drop-down menu,
followed by Download Client configuration. The zipped file that is downloaded contains the files core-site.xml,
hadoop-env.sh, hdfs-site.xml, hive-site.xml, mapred-site.xml, and yarn-site.xml. Because I am using the CDH5
(cdh50) configuration, I copy these files to the following PDI directory: data-integration\plugins\pentaho-big-
data-plugin\hadoop-configurations\cdh50 .
Pentaho requires Sun/Oracle Java 1.7, which is available at https://java.com/en/download/index.jsp . Be sure
to download and install this on Windows; the cmd.exe Window session output shows my Java installation as follows:
C:\Users\mikejf12>java -version
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 
Search WWH ::




Custom Search