Data Extraction, Transformation and Integration Guided by an Ontology - Data Warehousing Design and Advanced Engineering Applications

Database Reference

In-Depth Information

The reasoning is applied to R ∪ F : the set of rules

(put in clausal form) and the set of facts generated

as explained before. It aims at inferring all unit

facts in the form of Reconcile(i,j), ¬Reconcile(i,j),

SynVals(v 1 ,v 2 ) and ¬SynVals(v 1 ,v 2 ). Several reso-

lution strategies have been proposed so that the

number of computed resolutions to obtain the

theorem proof is reduced (for more details about

these strategies see (Chang & Lee, 1997)). We have

chosen to use the unit resolution (Henschen & Wos,

1974). It is a resolution strategy where at least one

of the two clauses involved in the resolution is a

unit clause, i.e. reduced to a single literal. The unit

resolution is complete for refutation in the case of

Horn clauses without functions (Henschen & Wos,

1974). Furthermore, it is linear with respect to the

size of clause set (Forbus & de Kleer, 1993). The

unit resolution algorithm that we have implemented

consists in computing the set of unit instantiated

clauses contained in F or inferred by unit resolution

on R ∪ F . Its termination is guaranteed because there

are no function symbols in R ∪ F . Its completeness

for deriving all the facts that are logically entailed

has been stated in (Saïs et al., 2009).

Solving this equation system is done by an

iterative method inspired from the Jacobi method

(Golub & Loan, 1996), which is fast converging

on linear equation systems. The point is that the

equation system is not linear, due to the use of

the max function for the numerical translation of

the functionality and inverse functionality axi-

oms declared in the RFDS+ schema. Therefore,

we had to prove the convergence of the iterative

method for solving the resulting non linear equa-

tion system.

N2R can be applied alone or in combination

with L2R. In this case, the results of non-recon-

ciliation inferred by L2R are exploited for reduc-

ing the reconciliation space, i.e., the size of the

equation system to be solved by N2R. In addition,

the results of reconciliations and of synonymies

or non-synonymies inferred by L2R are used to

set the values of the corresponding constants or

variables in the equations.

We first use a simple example to illustrate how

the equation system is built. Then, we describe

how the similarity dependencies between refer-

ences are modeled in an equation system and we

provide the iterative method for solving it.

N2R: A Numerical Method for

Reference Reconciliation

Example 2

Let us consider the data descriptions of the example

1 and the reference pairs <S1_r607,S2_r208>,

<S_d1e5, S2_l6f2>, <S1_p112,S2_p222> and

<S1_p112,S2_p232>.

The similarity score Sim r (ref, ref ') between the

references ref and ref ' of each of those pairs is

modeled by a variable: x 1 models Sim r (S1_r607,

S2_r208), x 2 models Sim r (S1_p112,S2_p222), x 3

models Sim r (S1_p112,S2_p232), x 4 models Sim r

(S_d1e5, S2_l6f2)

We obtain the following equations that

model the dependencies between those variables:

x 1 =max(0.68, x 2 , x 3 , x 4 /4) x 2 =max(0.1, x 1 /2)

x 3 =max(0.9, x 1 /2) x 4 =max(0.42, x1).

In this equation system, the first equation ex-

presses that the variable x 1 strongly and equally

depends on the variables x 2 and x 3 , and also on

N2R has two main distinguishing characteristics.

First, it is fully unsupervised: it does not require

any training phase from manually labeled data

to set up coefficients or parameters. Second, it is

based on equations that model the influence be-

tween similarities. In the equations, each variable

represents the (unknown) similarity between two

references while the similarities between values

of attributes are constants that are computed by

using standard similarity measures on strings or

on sets of strings. The functions modeling the

influence between similarities are a combination

of maximum and average functions in order to

take into account the constraints of functionality

and inverse functionality declared in the RFDS+

schema in an appropriate way.

Data Warehousing Design and Advanced Engineering Applications

Search WWH ::

Custom Search

Home