Extraction, Transformation, and Loading - Data Warehouse Systems: Design and Implementation

Database Reference

In-Depth Information

8.4.2 Overview of Kettle

We now give an overview of Kettle, a tool for designing and executing ETL

tasks. It is also known as Pentaho Data Integration and is a component of

the Pentaho Business Analytics suite.

The main components of Kettle are as follows:

Transformations , which are logical tasks consisting in steps connected by

hops, defined below. Transformations are essentially data flows, and their

purpose is to extract, transform, and load data.

Steps are the basic components of a transformation. A step performs a

specific task, such as reading data from a flat file, filtering rows, and writing

to a database. The steps available in Kettle are grouped according to their

function, such as input, output, scripting, and so on. Note that the steps

in a transformation run in parallel , each one in its own thread.

Hops are data paths that connect steps to each other, allowing records to

pass from one step to another. Hops determine the flow of data through

the steps, although not necessarily the sequence in which they run.

Jobs are workflows that orchestrate the individual pieces of functionality

implementing an entire ETL process. Jobs are composed of job entries, job

hops, and job settings.

Job entries are the primary building blocks of a job and correspond to the

steps in data transformations.

Job hops specify the execution order of job entries and the conditions on

which they are executed based on the results of previous entries. Job hops

behave differently from hops used in a transformation.

Job settings are the options that control the behavior of a job and the

method of logging a job's actions.

It is worth mentioning that loops are not allowed in transformations since

the field values that are passed from one step to another are dependent on the

previous steps, and as we said above, steps are executed in parallel. However,

loops are allowed in jobs since job entries are executed sequentially.

Kettle is composed of the following components:

Data Integration Server , which performs the actual data integration tasks.

Its primary functions are to execute jobs and transformations, to define and

manage security ,toprovide content management facilities to administer

jobs and transformations in collaborative development environments, and

to provide services for scheduling and monitoring activities.

Spoon , a graphical user interface for designing jobs and transformations.

The transformations can be executed locally within Spoon or in the Data

Integration Server. Spoon provides a way to create complex ETL jobs

withouthavingtoreadorwritecode.

Search WWH ::

Custom Search

Home