Designing Data Marts from XML and Relational Data Sources - Data Warehousing Design and Advanced Engineering Applications

Database Reference

In-Depth Information

step is trivial since it only extracts the database

schema (table names, column names and types,

and key constraints) from the repository of the

relational DBMS. On the other hand, for the XML

documents, it transforms their DTD into a rela-

tional schema ( i.e., a set of relational tables). To

do so, the DTD is first simplified to eliminate any

redundancies. Then, the simplified DTD (noted

DTD s ) is reorganized into a set of linked trees

that we call transition trees . In addition, since a

DTD is poor in typing information required for

the identification of multidimensional concepts

(measures, dimensional attributes …), this step

scans sample XML documents in order to extract

richer typing information; the extracted types are

assigned to the attributes and packed data elements

in the transition trees. Based on the existing links

among the typed transition trees, these latter are

transformed into a relational schema. This step

is concluded by a schema integration phase that

merges the two source schemas to produce one

semantically coherent schema; it applies exist-

ing propositions for relational database schema

integration, cf ., (Bright, Hurson, & Pakzad, 1994),

(Sheth, & Larson, 1990), (Ceri, Widom, 1993) ,

(Hull, 1997) and (Zhang, & Yang, 2008).

Once the data source pretreatment produces the

integrated relational schema, the design continues

with the relation classification step. This latter

performs a reverse engineering task by examining

the structure of the relations in the source sche-

mas obtained from the first step. It automatically

determines the conceptual class of each relation:

A relation conceptually either models a relation-

ship or an entity. This classification optimizes the

automatic fact and dimension identification and

improves its results.

The third step of our design method ( data mart

schema construction ) extracts the multidimen-

sional concepts (facts and their measures, dimen-

sions and attributes organized into hierarchies)

from the classified relations and produces star

models. To automate this step, we define for each

multidimensional concept a set of extraction rules.

Our rules are independent of the semantics of the

data sources and their domain. They rather rely on

the structural semantics of the relations, which is

mainly disseminated through the key constraints

(primary and foreign keys). In addition, our rules

have the merit to keep track of the origin of each

multidimensional concept in the generated data

mart schemas. This traceability is fundamental

during the definition of ETL processes.

Finally, the decision makers/designers are

presented with a set of potential data mart

schemas that they can adjust to meet their par-

ticular analytical requirements. In this final step

( data mart adaptation ), the decision makers/

designers can add derived data, remove, and/or

rename DM schema elements. The application

of these adaptation operations is constrained

to ensure that the resulting schemas are well-

formed (Schneider, 2003) (Salem, Ghozzi, &

Ben-Abdallah, 2008), e.g., a fact must have at

least two dimensions.

Before explaining in detail the above four steps,

in the remainder of this section, we overview the

concepts of XML structures and the relational

model.

Basic xML Structural Concepts

An XML document has two types of informa-

tion: the document structure and data content ;

XML provides a means for separating one from

the other in the electronic document. The docu-

ment structure is given by opening and closing,

matching tag pairs (each called an element ) and

the data content is given by the information be-

tween matching tags. In addition, an element can

have attributes whose values are assigned in the

opening tag of the element.

To define the structure of a set of XML docu-

ments, a DTD document can be used. A DTD is

a context free grammar specifying all allowable

elements, their attributes, and the element nest-

ing structure. Given one DTD, it can be verified

whether an XML document is conforming to/

Data Warehousing Design and Advanced Engineering Applications

Search WWH ::

Custom Search

Home