Modeling Uncertain Schema Matching - Uncertain Schema Matching

Databases Reference

In-Depth Information

highest value in that matrix. The two boxes at the bottom of Figure 3.1 were generated using a

beta distribution. According to Ross [ 1997 ] : “The beta distribution can be used to model a random

phenomenon whose set of possible values is in some finite interval

—which, by letting c denote

the origin and taking d − c as a unit measurement, can be transformed into the interval

[

c, d

]

[

]

.” A

beta distribution has two tuning parameters, a and b . To receive a density function that is skewed to

the left (as in the case of incorrect attribute correspondences, bottom left in Figure 3.1 ), we require

that b>a . For right-skewed density functions (as in the case of correct attribute correspondences,

bottom right), one needs to set a>b .

Going back to the semantics of data models, we note that schema matchers often use data

model semantics when determining the similarity between attributes. For example, XML structure

has been used in Cupid [ Madhavan et al. , 2001 ] to support or dispute linguistic similarities. Also,

the similarity flooding algorithm [ Melnik et al. , 2002 ] uses structural links between attributes to

update linguistic similarities. However, once this similarity has been determined and recorded in the

similarity matrix, the original semantics that derived it is no longer needed. Therefore, the matrix

representation, as given above, is sufficient to represent the uncertainty involved in the matching

process.

Similarity matrices have been used in the literature mainly as a convenient representation

model, rather than a formal model that is used for reasoning, with two exceptions. Do and Rahm

[ 2002 ] propose a cube to represent an ensemble of similarity values, transformed into a matrix by ag-

gregating the similarity values of each attribute matching across ensemble members. Domshlak et al.

[ 2007 ] have taken this process one step further and proposed the use of the matrix abstraction to

perform local and global aggregations as a matrix-to-constant and cube-to-matrix function (see

Chapter 4 ).

0 , 1

3.1.3 SCHEMA MATCHING

Let the power-set = 2 S be the set of all possible schema matchings between the schema pair S,S ,

where a schema matching σ ∈ is a set of attribute correspondences. It is worth noting that σ does

not necessarily contain all attributes in S or S .Therefore, there may exist an attribute A ∈ S , such that

= A ∈ S ∀ A ∈ S , A, A ∈ σ ∪

for all A ∈ S , A, A ∈ σ . For convenience, we denote by

A ∈ S ∀ A ∈ S, A, A ∈ σ the set of all attributes that do not participate in a schema matching.

Let

be a boolean function that captures the application-specific constraints

on schema matchings, e.g. , cardinality constraints and inter-attribute correspondence constraints.

partitions into two sets, where the set of all valid schema matchings in is given by ={

→{

0 , 1

}

∈

| (σ) =

}

. is a general constraint model, where (σ) =

1 means that the matching σ can be

accepted by a designer. has been modeled in the literature using special types of matchers called

constraint enforcers [ Leeetal. , 2007 ], whose output is recorded in a binary similarity matrix. We say

is a null constraint function (basically accepting all possible matchings as valid with no use of a

constraint enforcer) if for all σ

∈ , (σ) =

Uncertain Schema Matching

Search WWH ::

Custom Search

Home