Information Technology Reference
In-Depth Information
There is another element of ritual for which some weariness is appropriate:
the choice of a suitable statistical test. One of the most commonalty performed
tests in work on search based algorithms in general (though not necessarily SBSE
in particular) is the well-known t test. Almost all statistical packages support
it and it is often available at the touch of a button. Unfortunately, the t test
makes assumptions about the distribution of the data. These assumptions may
not be borne out in practice thereby increasing the chance of a Type I error. In
some senses a type I error is worse than a Type II error, because it may lead to
the publication of false claims, whereas a Type I error will most likely lead to
researcher disappointment at the lack of evidence to support publishable results.
To address this potential problem with parametric inferential statistics SBSE
researchers often use nonparametric statistical tests. Non-parametric tests make
fewer assumptions about the distribution of the data. As such, these tests are weaker
(they have less power) and may lead to the false acceptance of the null hypothesis
for the same sample size (a Type II error), when used in place of a more powerful
parametric test that is able to reject the null hypothesis. However, since the para-
metric tests make assumptions about the distribution, should these assumptions
prove to be false, then the rejection of the null hypothesis by a parametric test may
be an artefact of the false assumptions; a form of Type I error.
It is important to remember that all inferential statistical techniques are
founded on probability theory. To the traditional computer scientist, particularly
those raised on an intellectual diet consisting exclusively of formal methods and
discrete mathematics, this reliance on probability may be as unsettling as quan-
tum mechanics was to the traditional world of physics. However, as engineers,
the reliance on a confidence level is little more than an acceptance of a certain
'tolerance' and is quite natural and acceptable.
This appreciation of the probability-theoretic foundations of inferential statis-
tics rather than a merely ritualistic application of 'prescribed tests' is important
if the researcher is to avoid mistakes. For example, armed with a non parametric
test and a confidence internal of 95% the researcher may embark on a misguided
'fishing expedition' to find a variant of Algorithm A that outperforms Algorithm
B . Suppose 5 independent variants of Algorithm A are experimented with and,
on each occasion, a comparison is made with Algorithm B using an inferential
statistical test. If variant 3 produces a p -value of 0.05, while the others do not
it would be a mistake to conclude that at the 95% confidence level Algorithm A
(variant 3) is better than Algorithm B.
Rather, we would have to find that Algorithm A variant 3 had a p -value
lower than 0 . 05 / 5; by repeating the same test 5 times, we raise the confidence
required for each test from 0.05 to 0.01 to retain the same overall confidence.
This is known as a 'Bonferroni correction'. To see why it is necessary, suppose
we have 20 variants of Algorithms A . What would be the expected likelihood
that one of these would, by chance ,havea p -value equal or lower than 0.05 in a
world where none of the variants is, in fact, any different from Algorithm B ?If
we repeat a statistical test suciently many times without a correction to the
Search WWH ::




Custom Search