Data Extraction, Transformation and Integration Guided by an Ontology - Data Warehousing Design and Advanced Engineering Applications

Database Reference

In-Depth Information

wrappers given a set of mappings. It starts from

the description of the abstract content of an ex-

ternal source and performs data acquisition, i.e.

data extraction and transformation in order to

conform to a same global schema. The description

of the abstract content of an external source is also

usable to manage sources with data that remain

locally stored, making that way our techniques

quite integrated to the PICSEL mediator-based

approach. The transformation phase is then fol-

lowed by a reconciliation step whose aim is to

handle several problems: possible mismatches

between data referring to the same real world

object (different conventions and vocabularies can

be used to represent and describe data), possible

errors in the data stored in the sources especially

frequent when data are automatically extracted

from the Web, possible inconsistencies between

values representing the properties of the real world

objects in different sources. This reconciliation

step is essential because the conformity to a same

global schema does not indeed prevent variations

between data descriptions. For this last step, we

propose a knowledge-based and unsupervised

approach, based on two methods, a logical one

called L2R and a numerical one called N2R. The

Logical method for Reference Reconciliation

(L2R) is based on the translation in first order logic

Horn rules of some of the schema semantics. In

order to complement the partial results of L2R, we

have designed a Numerical method for Reference

Reconciliation (N2R). It exploits the L2R results

and allows computing similarity scores for each

pair of references.

The paper is organized as follows. In section

2, we present close related work and point out the

novel features of the approach presented in this

chapter. In section 3, we describe our approach.

First, we define the data model used to represent

the ontology and the data, the XML sources and the

mappings automatically generated used as inputs

in the data extraction and transformation process.

We present the data extraction and transformation

tasks and then the two reconciliation techniques

(L2R and N2R) followed by a summary of the

results that we have obtained. In section 4 we

briefly describe future research directions. Finally,

section 5 concludes the chapter.

BACKGROUND

Many modern applications such as data warehous-

ing, global information systems and electronic

commerce need to take existing data with a par-

ticular schema, and reuse it in a different form. For

a long time data conversion has usually been done

in an ad hoc manner by developing non reusable

software. Later language-based and declarative

approaches have provided tools for the specifi-

cation and implementation of data and schema

translations among heterogeneous data sources

(Abiteboul et al.,1997; Cluet et al., 1998). Such

rule-based approaches can deal with complex

transformations due to the diversity in the data

model and to schema matching. In the former case,

the approach helps to customize general purpose

translation tools. In the latter case, the idea is

that the system automatically finds the matching

between two schemas, based on a set of rules that

specify how to perform the matching. All these

works provide tools to design data conversion

programs but they do not provide the ability to

query external sources. More recently, the Clio

system (Popa et al., 2002) has been proposed as

a complement and an extension of the language-

based approaches. Given value correspondences

that describe how to populate a single attribute of

a target schema, this system discovers the mapping

query needed to transform source data to target

data. It produces SQL queries and provides users

with data samples to allow them to understand the

mappings produced.

Our work can also be compared to data integra-

tion systems providing mechanisms for uniformly

querying sources through a target schema but

avoiding materializing it in advance. These works

adopt either the Global-As-View (GAV) approach

Data Warehousing Design and Advanced Engineering Applications

Search WWH ::

Custom Search

Home