Schema Matcher Ensembles - Uncertain Schema Matching

Databases Reference

In-Depth Information

boosting algorithms ( e.g. ,[ Schapire , 1990 ]). This class of algorithms can strengthen weak classifiers

to achieve arbitrarily high accuracy; they have been shown to be effective in the construction of

successful classifiers. Given a set of weak classifiers, the algorithm iterates over them while re-

weighting the importance of elements in the training set.

Many boosting algorithms have been proposed in the literature. Following Gal and Sagi

[ 2010 ], we describe a matcher based on the AdaBoost algorithm [ Freund and Schapire , 1999 ].

AdaBoost, presented in Section 4.3.1 , is the most popular and historically most significant boost-

ing algorithm. SMB , the boosting heuristic for schema matching, is described first, followed by a

discussion of its role in selecting matchers for ensembles.

4.3.1

ADABOOST

Algorithm 1 Boosting

1: Input: S

(x 1 ,y 1 ), …, (x m ,y m )

}

, and a space hypotheses

∀

≤ i ≤ m, x i

∀

≤ i ≤ m, y i {−

1 , +

}

2: /*

, and

3: /* initialization: */

4: for all 1

≤ i ≤ m do

5: D 1 (i) = 1 /m

6: end for

7: t = 1

8: repeat

/* training phase: */

Find the classifier h t : X →{−

1 ,

}

, h t

that minimizes the error with respect to the

10:

distribution D t : h t = arg h j min ε j .

if ε t ≤

0 . 5 then

11:

2 ln 1 − ε t

Choose α t R . α t =

12:

ε t

D t (i) exp ( − α t y i h t (x i ))

Z t

Update D t + 1 (i)

where Z t is a normalization factor

13:

t = t +

14:

15: end if

16: until t = T or ε t > 0 . 5

17: /* upon arrival of a new instance: */

18: Output the final classifier: H(x) = sign( min (t,T )

k = 1

α k h k (x))

The AdaBoost algorithm template is given in Algorithm 1 . The input to a boosting algorithm

is a set of m examples where each example (x i ,y i ) is a pair of an instance x i and the classification

of the instance mapping, y i . y i typically (though not always) accepts a binary value in

{−

1 ,

1 }

where

1 stands for a correct classification. Therefore,

the algorithm is aimed at binary classifications. The last input element is a hypothesis space

−

1 stands for an incorrect classification and

, a set

of weak classifiers.

Uncertain Schema Matching

Search WWH ::

Custom Search

Home