Data Extraction, Transformation and Integration Guided by an Ontology - Data Warehousing Design and Advanced Engineering Applications

Database Reference

In-Depth Information

In order to improve their efficiency, some re-

cent methods exploit knowledge which is either

learnt by using supervised algorithms or explicitly

specified by a domain expert. For instance, in

(Dey et al., 1998b; Dong et al., 2005), knowledge

about the impacts of the different attributes or

relations are encoded in weights by an expert or

learnt on labelled data. However, these methods

are time consuming and dependent on the human

experience for labelling the training data or to

specify declaratively additional knowledge for

the reference reconciliation. Both the L2R and

N2R methods exploit the semantics on the schema

or on the data, expressed by a set of constraints.

They are unsupervised methods since no labelled

data is needed by either L2R or N2R.

Most of the existing methods infer only rec-

onciliation decisions. However, some methods

infer non-reconciliation decisions for reducing

the reconciliation space. This is the case for the

so-called blocking methods introduced in (New-

combe, 1962) and used in recent approaches such

as (Baxter et al., 2003).

because it extends RDFS with some OWL-DL

primitives and SWRL rules, both being used to

state constraints that enrich the semantics of the

classes and properties declared in RDFS. Then

we describe the XML sources we are interested in

and the mappings that are automatically generated

and then used as inputs of the data extraction and

transformation process.

The RDFS+ Data Model

RDFS+ can be viewed as a fragment of the rela-

tional model (restricted to unary and binary rela-

tions) enriched with typing constraints, inclusion

and exclusion between relations and functional

dependencies.

The Schema and its Constraints

A RDFS schema consists of a set of classes (unary

relations) organized in a taxonomy and a set of

typed properties (binary relations). These proper-

ties can also be organized in a taxonomy of proper-

ties. Two kinds of properties can be distinguished

in RDFS: the so-called relations, the domain and

the range of which are classes and the so-called

attributes, the domain of which is a class and the

range of which is a set of basic values (e.g. Integer,

Date, Literal). For example, in the RDFS schema

presented in Figure 2, we have a relation located

having as domain the class CulturalPlace and as

range the class Address . We also have an attribute

name having as domain the class CulturalPlace

and as range the data type Literal.

We allow the declaration of constraints ex-

pressed in OWL-DL or in SWRL in order to

enrich the RDFS schema. The constraints that we

consider are of the following types:

THE PICSEL3 DATA ExTRACTION,

TRANSFORMATION AND

INTEGRATION APPROACH

In this section, we first define the data model used

to represent the ontology and the data, the exter-

nal XML sources and the mappings. In a second

sub-section, we present the data extraction and

transformation tasks and then the two reconcili-

ation techniques (L2R and N2R) followed by a

summary of the results that we have obtained by

performing these methods on data sets related to

the scientific publications.

•

Constraints of disjunction between classes:

Data Model, xML Sources

and Mappings

DISJOINT( C,D ) is used to declare that the

two classes C and D are disjoint, for ex-

ample: DISJOINT ( CulturalPlace , Artist ).

Constraints of functionality of properties:

We first describe the data model used to represent

the ontology O . This model is called RDFS+

•

PF( P ) is used to declare that the property P

Data Warehousing Design and Advanced Engineering Applications

Search WWH ::

Custom Search

Home