Database Reference
In-Depth Information
8.4.2 Overview of Kettle
We now give an overview of Kettle, a tool for designing and executing ETL
tasks. It is also known as Pentaho Data Integration and is a component of
the Pentaho Business Analytics suite.
The main components of Kettle are as follows:
￿ Transformations , which are logical tasks consisting in steps connected by
hops, defined below. Transformations are essentially data flows, and their
purpose is to extract, transform, and load data.
￿ Steps are the basic components of a transformation. A step performs a
specific task, such as reading data from a flat file, filtering rows, and writing
to a database. The steps available in Kettle are grouped according to their
function, such as input, output, scripting, and so on. Note that the steps
in a transformation run in parallel , each one in its own thread.
￿ Hops are data paths that connect steps to each other, allowing records to
pass from one step to another. Hops determine the flow of data through
the steps, although not necessarily the sequence in which they run.
￿ Jobs are workflows that orchestrate the individual pieces of functionality
implementing an entire ETL process. Jobs are composed of job entries, job
hops, and job settings.
￿ Job entries are the primary building blocks of a job and correspond to the
steps in data transformations.
￿ Job hops specify the execution order of job entries and the conditions on
which they are executed based on the results of previous entries. Job hops
behave differently from hops used in a transformation.
￿ Job settings are the options that control the behavior of a job and the
method of logging a job's actions.
It is worth mentioning that loops are not allowed in transformations since
the field values that are passed from one step to another are dependent on the
previous steps, and as we said above, steps are executed in parallel. However,
loops are allowed in jobs since job entries are executed sequentially.
Kettle is composed of the following components:
￿ Data Integration Server , which performs the actual data integration tasks.
Its primary functions are to execute jobs and transformations, to define and
manage security ,toprovide content management facilities to administer
jobs and transformations in collaborative development environments, and
to provide services for scheduling and monitoring activities.
￿ Spoon , a graphical user interface for designing jobs and transformations.
The transformations can be executed locally within Spoon or in the Data
Integration Server. Spoon provides a way to create complex ETL jobs
withouthavingtoreadorwritecode.
Search WWH ::




Custom Search