Search Based Software Engineering: Techniques, Taxonomy, Tutorial - Empirical Software Engineering and Verification

Information Technology Reference

In-Depth Information

There is another element of ritual for which some weariness is appropriate:

the choice of a suitable statistical test. One of the most commonalty performed

tests in work on search based algorithms in general (though not necessarily SBSE

in particular) is the well-known t test. Almost all statistical packages support

it and it is often available at the touch of a button. Unfortunately, the t test

makes assumptions about the distribution of the data. These assumptions may

not be borne out in practice thereby increasing the chance of a Type I error. In

some senses a type I error is worse than a Type II error, because it may lead to

the publication of false claims, whereas a Type I error will most likely lead to

researcher disappointment at the lack of evidence to support publishable results.

To address this potential problem with parametric inferential statistics SBSE

researchers often use nonparametric statistical tests. Non-parametric tests make

fewer assumptions about the distribution of the data. As such, these tests are weaker

(they have less power) and may lead to the false acceptance of the null hypothesis

for the same sample size (a Type II error), when used in place of a more powerful

parametric test that is able to reject the null hypothesis. However, since the para-

metric tests make assumptions about the distribution, should these assumptions

prove to be false, then the rejection of the null hypothesis by a parametric test may

be an artefact of the false assumptions; a form of Type I error.

It is important to remember that all inferential statistical techniques are

founded on probability theory. To the traditional computer scientist, particularly

those raised on an intellectual diet consisting exclusively of formal methods and

discrete mathematics, this reliance on probability may be as unsettling as quan-

tum mechanics was to the traditional world of physics. However, as engineers,

the reliance on a confidence level is little more than an acceptance of a certain

'tolerance' and is quite natural and acceptable.

This appreciation of the probability-theoretic foundations of inferential statis-

tics rather than a merely ritualistic application of 'prescribed tests' is important

if the researcher is to avoid mistakes. For example, armed with a non parametric

test and a confidence internal of 95% the researcher may embark on a misguided

'fishing expedition' to find a variant of Algorithm A that outperforms Algorithm

B . Suppose 5 independent variants of Algorithm A are experimented with and,

on each occasion, a comparison is made with Algorithm B using an inferential

statistical test. If variant 3 produces a p -value of 0.05, while the others do not

it would be a mistake to conclude that at the 95% confidence level Algorithm A

(variant 3) is better than Algorithm B.

Rather, we would have to find that Algorithm A variant 3 had a p -value

lower than 0 . 05 / 5; by repeating the same test 5 times, we raise the confidence

required for each test from 0.05 to 0.01 to retain the same overall confidence.

This is known as a 'Bonferroni correction'. To see why it is necessary, suppose

we have 20 variants of Algorithms A . What would be the expected likelihood

that one of these would, by chance ,havea p -value equal or lower than 0.05 in a

world where none of the variants is, in fact, any different from Algorithm B ?If

we repeat a statistical test suciently many times without a correction to the

Empirical Software Engineering and Verification

Search WWH ::

Custom Search

Home