Databases Reference
In-Depth Information
warehouse based on the same corporate data model set noted before. Third, distributed
data marts can be constructed to integrate different data marts via hub servers. Finally,
a multitier data warehouse is constructed where the enterprise warehouse is the sole
custodian of all warehouse data, which is then distributed to the various dependent
data marts.
4.1.6 Extraction, Transformation, and Loading
Data warehouse systems use back-end tools and utilities to populate and refresh their
data (Figure 4.1). These tools and utilities include the following functions:
Data extraction , which typically gathers data from multiple, heterogeneous, and
external sources.
Data cleaning , which detects errors in the data and rectifies them when possible.
Data transformation , which converts data from legacy or host format to warehouse
format.
Load , which sorts, summarizes, consolidates, computes views, checks integrity, and
builds indices and partitions.
Refresh , which propagates the updates from the data sources to the warehouse.
Besides cleaning, loading, refreshing, and metadata definition tools, data warehouse
systems usually provide a good set of data warehouse management tools.
Data cleaning and data transformation are important steps in improving the data
quality and, subsequently, the data mining results (see Chapter 3). Because we are mostly
interested in the aspects of data warehousing technology related to data mining, we will
not get into the details of the remaining tools, and recommend interested readers to
consult topics dedicated to data warehousing technology.
4.1.7 Metadata Repository
Metadata are data about data. When used in a data warehouse, metadata are the data
that define warehouse objects. Figure 4.1 showed a metadata repository within the bot-
tom tier of the data warehousing architecture. Metadata are created for the data names
and definitions of the given warehouse. Additional metadata are created and captured
for timestamping any extracted data, the source of the extracted data, and missing fields
that have been added by data cleaning or integration processes.
A metadata repository should contain the following:
A description of the data warehouse structure , which includes the warehouse schema,
view, dimensions, hierarchies, and derived data definitions, as well as data mart
locations and contents.
 
Search WWH ::




Custom Search