Databases Reference
In-Depth Information
boosting
algorithms (
e.g.
,[
Schapire
,
1990
]). This class of algorithms can strengthen weak classifiers
to achieve arbitrarily high accuracy; they have been shown to be effective in the construction of
successful classifiers. Given a set of weak classifiers, the algorithm iterates over them while re-
weighting the importance of elements in the training set.
Many boosting algorithms have been proposed in the literature. Following
Gal and Sagi
[
2010
], we describe a matcher based on the AdaBoost algorithm [
Freund and Schapire
,
1999
].
AdaBoost, presented in Section
4.3.1
, is the most popular and historically most significant boost-
ing algorithm.
SMB
, the boosting heuristic for schema matching, is described first, followed by a
discussion of its role in selecting matchers for ensembles.
4.3.1
ADABOOST
Algorithm 1
Boosting
1:
Input:
S
={
(x
1
,y
1
),
…,
(x
m
,y
m
)
}
, and a space hypotheses
H
.
∀
≤
i
≤
m, x
i
X
∀
≤
i
≤
m, y
i
{−
1
,
+
}
2:
/*
1
, and
1
1
*/
3:
/* initialization: */
4:
for all
1
≤
i
≤
m
do
5:
D
1
(i)
=
1
/m
6:
end for
7:
t
=
1
8:
repeat
9:
/* training phase: */
Find the classifier
h
t
:
X
→{−
1
,
+
1
}
,
h
t
H
that minimizes the error with respect to the
10:
distribution
D
t
:
h
t
=
arg
h
j
min
ε
j
.
if
ε
t
≤
0
.
5
then
11:
2
ln
1
−
ε
t
Choose
α
t
R
.
α
t
=
12:
ε
t
D
t
(i)
exp
(
−
α
t
y
i
h
t
(x
i
))
Z
t
Update
D
t
+
1
(i)
=
where
Z
t
is a normalization factor
13:
t
=
t
+
1
14:
15:
end if
16:
until
t
=
T
or
ε
t
>
0
.
5
17:
/* upon arrival of a new instance: */
18:
Output the final classifier:
H(x)
=
sign(
min
(t,T )
k
=
1
α
k
h
k
(x))
The AdaBoost algorithm template is given in Algorithm
1
. The input to a boosting algorithm
is a
set
of
m
examples where each example
(x
i
,y
i
)
is a pair of an instance
x
i
and the classification
of the instance mapping,
y
i
.
y
i
typically (though not always) accepts a binary value in
{−
1
,
+
1
}
,
where
1 stands for a correct classification. Therefore,
the algorithm is aimed at binary classifications. The last input element is a hypothesis space
−
1 stands for an incorrect classification and
+
H
, a set
of weak classifiers.