Databases Reference
In-Depth Information
D t
C t + B t + D t
B t +
R (t) =
(4.11)
and F-Measure
2 (B t + D t )
A t + C t +
FM(t) =
(4.12)
2 B t +
2 D t
which yields an error measure of
2 (B t + D t )
A t + C t +
ε t = 1 FM(t) = 1
2 B t +
2 D t
A t + C t
=
(4.13)
A t +
C t +
2 B t +
2 D t
Empirical evaluation suggests that Eq. 4.10 performs better than other error measures. We hypoth-
esize that the reason for the difference in performance is that typically
, and with such
imbalance, it is impossible to credit the matchers for their successful selection of true positives. To
understand this last argument , consider two schemata of size n with an exact matching of size n
(a typical example of 1
| B t | | D t |
1 matching). Therefore, there are n positive examples and n 2
n negative
examples, a clear imbalance of positive and negative examples. Other efforts, using various weighing
methods to balance the different sets, have yielded little improvement.
:
as
described above, and given a dataset of size 70, the SMB heuristic performs 5 iterations. First, it
creates a dataset with equal weight for each mapping. In the first iteration, it picks ( Composition ,
Dominants ) 3 , which yields the most accurate hypothesis over the initial weight distribution ( ε 1 =
0 . 328
This example is due to Gal and Sagi [ 2010 ]. Given the hypothesis space
H
Example 4.2
0 . 359). In the second iteration, the selected hypothesis is ( Precedence , Intersection )
with ε 2 = 0 . 411 and α 2 = 0 . 180, and in the third, ( Precedence , MWBG ) with ε 3 = 0 . 42 α 3 =
0 . 161. The fourth hypothesis selected is ( Term and Value , Intersection ), with ε 4 =
α 1 =
0 . 46 and α 4 =
0 . 080. The fifth and final selection is ( Term and Value , MWBG ), with ε 5 =
0 . 020.In
the sixth iteration, no hypothesis performs better than 50% error, so the training phase is terminated
after 5 iterations, each with strength α t . The outcome classification rule is a linear combination of
the five weak matchers with their strengths as coefficients. So, given a new attribute pair (a, a ) to
be considered, each of the weak matchers contributes to the final decision such that its decision is
weighted by its strength. If the final decision is positive, the given attribute pair is classified as an
attribute correspondence. If not, it will be classified as incorrect.
0 . 49
α 5 =
h max be the maximum execution time of a matcher in
Let
H
and t max be the number of
iterations performed by SMB . The training time of SMB is O h max · t max . Given a new schema pair,
let n max be the maximum number of attributes in each schema. The cost of using SMB is O n 2 max ,
the cost of generating the output matrix.
3 Descriptions of all matchers in this example are given in Section 3.1.2 .
 
Search WWH ::




Custom Search