Schema Matcher Ensembles - Uncertain Schema Matching

Databases Reference

In-Depth Information

D t

C t + B t + D t

B t +

R (t) =

(4.11)

and F-Measure

2 (B t + D t )

A t + C t +

FM(t) =

(4.12)

2 B t +

2 D t

which yields an error measure of

2 (B t + D t )

A t + C t +

ε t = 1 − FM(t) = 1 −

2 B t +

2 D t

A t + C t

(4.13)

A t +

C t +

2 B t +

2 D t

Empirical evaluation suggests that Eq. 4.10 performs better than other error measures. We hypoth-

esize that the reason for the difference in performance is that typically

, and with such

imbalance, it is impossible to credit the matchers for their successful selection of true positives. To

understand this last argument , consider two schemata of size n with an exact matching of size n

(a typical example of 1

| B t | | D t |

1 matching). Therefore, there are n positive examples and n 2

− n negative

examples, a clear imbalance of positive and negative examples. Other efforts, using various weighing

methods to balance the different sets, have yielded little improvement.

described above, and given a dataset of size 70, the SMB heuristic performs 5 iterations. First, it

creates a dataset with equal weight for each mapping. In the first iteration, it picks ( Composition ,

Dominants ) 3 , which yields the most accurate hypothesis over the initial weight distribution ( ε 1 =

0 . 328

This example is due to Gal and Sagi [ 2010 ]. Given the hypothesis space

Example 4.2

0 . 359). In the second iteration, the selected hypothesis is ( Precedence , Intersection )

with ε 2 = 0 . 411 and α 2 = 0 . 180, and in the third, ( Precedence , MWBG ) with ε 3 = 0 . 42 ⇒ α 3 =

0 . 161. The fourth hypothesis selected is ( Term and Value , Intersection ), with ε 4 =

⇒

α 1 =

0 . 46 and α 4 =

0 . 080. The fifth and final selection is ( Term and Value , MWBG ), with ε 5 =

0 . 020.In

the sixth iteration, no hypothesis performs better than 50% error, so the training phase is terminated

after 5 iterations, each with strength α t . The outcome classification rule is a linear combination of

the five weak matchers with their strengths as coefficients. So, given a new attribute pair (a, a ) to

be considered, each of the weak matchers contributes to the final decision such that its decision is

weighted by its strength. If the final decision is positive, the given attribute pair is classified as an

attribute correspondence. If not, it will be classified as incorrect.

0 . 49

⇒ α 5 =

h max be the maximum execution time of a matcher in

Let

and t max be the number of

iterations performed by SMB . The training time of SMB is O h max · t max . Given a new schema pair,

let n max be the maximum number of attributes in each schema. The cost of using SMB is O n 2 max ,

the cost of generating the output matrix.

3 Descriptions of all matchers in this example are given in Section 3.1.2 .

Uncertain Schema Matching

Search WWH ::

Custom Search

Home