Extraction, Transformation, and Loading - Data Warehouse Systems: Design and Implementation

Database Reference

In-Depth Information

Pan , a stand-alone command line tool for executing transformations. Pan

reads data from and writes data to various data sources and is used also

to manipulate data.

Kitchen , a stand-alone command line tool for executing jobs. Jobs are

usually scheduled to run in batch mode at regular intervals.

Carte , a lightweight server for running jobs and transformations on

a remote host. It provides similar execution capabilities as the Data

Integration Server but does not provide scheduling, security, and content

management facilities.

We illustrate these concepts in Sect. 8.6 when we study the implementation

of the Northwind ETL process in Kettle.

8.5 The Northwind ETL Process in Integration

Services

In this section, we show an implementation in Integration Services of the

ETL process that loads the Northwind data warehouse. We compare this

implementation with the conceptual design presented in Sect. 8.3 and show

how to translate the constructs of the conceptual ETL language into the

equivalent ones in Integration Services. In our implementation, the Northwind

operational database and the Northwind data warehouse are located on an

SQL Server database.

Figure 8.26 shows the overall ETL process. It is composed of one sequence

container task (the one with the blue arrow in the left) and eleven data flow

tasks. All of these tasks are connected by precedence constraints, represented

by green arrows. The reader can compare this representation with the one

in Fig. 8.18 . Note that gateways are not present, but the semantics of the

corresponding arrows is quite similar: no task can start until all precedent

tasks have finished.

Several data flow tasks are simple. For example, the task that loads the

Category table is given in Fig. 8.27 a (compare with Fig. 8.19 ), where the data

flow tasks are an OLE DB Source task that reads the entire table from the

operational database and an OLE DB Destination task that receives the

output from the previous task and stores it in the data warehouse. Similar

data flows are used for loading the Product , Shipper ,and Employee tables.

Another straightforward task is the data flow that loads the Time

dimension table, shown in Fig. 8.27 b. After loading the source Excel file,

a data conversion transformation is needed to convert the data types from

the Excel file into the data types of the database. We can also see that this is

very similar to the conceptual specification depicted in Fig. 8.20 , except that

the addition of the surrogate key column is implicit in the Time Load task.

We further explain this next.

Search WWH ::

Custom Search

Home