Database Reference
In-Depth Information
step is trivial since it only extracts the database
schema (table names, column names and types,
and key constraints) from the repository of the
relational DBMS. On the other hand, for the XML
documents, it transforms their DTD into a rela-
tional schema ( i.e., a set of relational tables). To
do so, the DTD is first simplified to eliminate any
redundancies. Then, the simplified DTD (noted
DTD s ) is reorganized into a set of linked trees
that we call transition trees . In addition, since a
DTD is poor in typing information required for
the identification of multidimensional concepts
(measures, dimensional attributes …), this step
scans sample XML documents in order to extract
richer typing information; the extracted types are
assigned to the attributes and packed data elements
in the transition trees. Based on the existing links
among the typed transition trees, these latter are
transformed into a relational schema. This step
is concluded by a schema integration phase that
merges the two source schemas to produce one
semantically coherent schema; it applies exist-
ing propositions for relational database schema
integration, cf ., (Bright, Hurson, & Pakzad, 1994),
(Sheth, & Larson, 1990), (Ceri, Widom, 1993) ,
(Hull, 1997) and (Zhang, & Yang, 2008).
Once the data source pretreatment produces the
integrated relational schema, the design continues
with the relation classification step. This latter
performs a reverse engineering task by examining
the structure of the relations in the source sche-
mas obtained from the first step. It automatically
determines the conceptual class of each relation:
A relation conceptually either models a relation-
ship or an entity. This classification optimizes the
automatic fact and dimension identification and
improves its results.
The third step of our design method ( data mart
schema construction ) extracts the multidimen-
sional concepts (facts and their measures, dimen-
sions and attributes organized into hierarchies)
from the classified relations and produces star
models. To automate this step, we define for each
multidimensional concept a set of extraction rules.
Our rules are independent of the semantics of the
data sources and their domain. They rather rely on
the structural semantics of the relations, which is
mainly disseminated through the key constraints
(primary and foreign keys). In addition, our rules
have the merit to keep track of the origin of each
multidimensional concept in the generated data
mart schemas. This traceability is fundamental
during the definition of ETL processes.
Finally, the decision makers/designers are
presented with a set of potential data mart
schemas that they can adjust to meet their par-
ticular analytical requirements. In this final step
( data mart adaptation ), the decision makers/
designers can add derived data, remove, and/or
rename DM schema elements. The application
of these adaptation operations is constrained
to ensure that the resulting schemas are well-
formed (Schneider, 2003) (Salem, Ghozzi, &
Ben-Abdallah, 2008), e.g., a fact must have at
least two dimensions.
Before explaining in detail the above four steps,
in the remainder of this section, we overview the
concepts of XML structures and the relational
model.
Basic xML Structural Concepts
An XML document has two types of informa-
tion: the document structure and data content ;
XML provides a means for separating one from
the other in the electronic document. The docu-
ment structure is given by opening and closing,
matching tag pairs (each called an element ) and
the data content is given by the information be-
tween matching tags. In addition, an element can
have attributes whose values are assigned in the
opening tag of the element.
To define the structure of a set of XML docu-
ments, a DTD document can be used. A DTD is
a context free grammar specifying all allowable
elements, their attributes, and the element nest-
ing structure. Given one DTD, it can be verified
whether an XML document is conforming to/
Search WWH ::




Custom Search