Database Reference
In-Depth Information
Chapter 10
ETL with Hadoop
Given that Hadoop-based Map Reduce programming is a relatively new skill, there is likely to be a shortage of highly
skilled staff for some time, and those skills will come at a premium price. ETL (extract, transform, and load) tools,
like Pentaho and Talend, offer a visual, component-based method to create Map Reduce jobs, allowing ETL chains
to be created and manipulated as visual objects. Such tools are a simpler and quicker way for staff to approach Map
Reduce programming. I'm not suggesting that they are a replacement for Java or Pig-based code, but as an entry point
they offer a great deal of pre-defined functionality that can be merged so that complex ETL chains can be created and
scheduled. This chapter will examine these two tools from installation to use, and along the way, I will offer some
resolutions for common problems and errors you might encounter.
Pentaho Data Integrator
In this first half of the chapter I explain how to source and install the Pentaho Data Integration (PDI) application.
Offering tools to analyze, visualize, explore, report, and predict in the same platform, PDI can work as a stand-alone
tool or can be downloaded into Pentaho Business Analytics. Pentaho offers enhanced functionality, features, and
professional support for PDI. The open-source version is called Kettle. PDI is downloaded as a generic zipped package
that can be installed on either Windows or Linux. Here's how to install PDI and use it with Hadoop.
Note
For complete details on PDI, see the company's website at www.pentaho.com/product/data-integration .
Installing Pentaho
You can download the installation package for the Pentaho Data Integrator (PDI) from the following URL:
http://sourceforge.net/projects/pentaho/files/Data%20Integration/5.1/pdi-ce-5.1.0.0-752.zip/
download .
With an installation package that you can use for either Lunix or Windows, the zipped file is 580 MB, so it takes
quite a while to download. By way of example, after I downloaded and extracted the package, I installed the Windows
package on my C: drive, as shown in Figure 10-1 . As you can see in Figure 10-1 , the software installs into a directory
called “data-integration.” Note the directory structure, the start-up scripts, and the plug-ins directory that have been
marked with red boxes. On Windows, you would start the application using the Spoon.bat script; on Linux, you would
use the Spoon.sh script.
 
 
Search WWH ::




Custom Search