Database Reference
In-Depth Information
Table 4.6
McNemar's test: contingency table.
Number of examples misclassified
Number of examples misclassified by
f
A
but not by
ˆ
ˆ
f
B
(
n
01
)
by both classifiers (
n
00
)
Number of examples misclassified
Number of examples misclassified
by
ˆ
f
B
but not by
ˆ
neither by
ˆ
f
A
nor by
ˆ
f
A
(
n
10
)
f
B
(
n
11
)
Table 4.7
Expected counts under
H
0
.
n
00
(
n
01
+
n
10
)
/
2)
(
n
01
+
n
10
)
/
2)
n
11
)
training set and the result is two classifiers. These classifiers are tested on
T
and for each example
x
T
we record how it was classified. Thus, the
contingency table presented in Table 4.6 is constructed.
The two inducers should have the same error rate under the null
hypothesis
H
0
. McNemar's test is based on a
χ
2
test for goodness-of-fit
that compares the distribution of counts expected under null hypothesis
to the observed counts. The expected counts under
H
0
∈
are presented in
Table 4.7.
The following statistic,
s
, is distributed as
χ
2
with 1 degree of freedom.
It incorporates a “continuity correction” term (of
1 in the numerator) to
account for the fact that the statistic is discrete while the
χ
2
−
distribution
is continuous:
1)
2
s
=
(
|
n
10
−
n
01
|−
.
(4.23)
n
10
+
n
01
According to the probabilistic theory
[
Athanasopoulos, 1991
]
,ifthenull
hypothesis is correct, the probability that the value of the statistic,
s
,is
greater than
χ
1
,
0
.
95
is less than 0
.
05, i.e.
P
(
>χ
1
,
0
.
95
)
<
0
.
05. Then, to
compare the inducers A and B, the induced classifiers
f
A
and
f
B
are tested
on T and the value of
s
is estimated as described above. Then if
|
s
|
>χ
1
,
0
.
95
,
the null hypothesis could be rejected in favor of the hypothesis that the two
inducers have different performance when trained on the particular training
set
R
.
The shortcomings of this test are:
|
s
|
(1) It does not directly measure variability due to the choice of the
training set or the internal randomness of the inducer. The inducers are
compared using a single training set R. Thus McNemar's test should
be only applied if we consider that the sources of variability are small.
Search WWH ::
Custom Search