Search Based Software Engineering: Techniques, Taxonomy, Tutorial - Empirical Software Engineering and Verification

Information Technology Reference

In-Depth Information

to the stochastic nature of the algorithm and so we are not in a position to say

exactly what the value obtained will be. Rather, we can give a range of values.

However, it is almost always impractical to perform all possible runs and so

we have to sample. Our '95% confidence claim' is that we are 95% confident that

the evidence provided by our sample allows us to infer a conclusion about the

algorithm's performance on the whole population. This is why this branch of

statistics is referred to as 'inferential statistics'; we infer properties of a whole

population based on a sample.

Unfortunately a great deal of 'ritualistic' behaviour has grown up around the

experimental sciences, in part, resulting for an inadequate understanding of the

underlying statistics. One aspect of this ritual is found in the choice of a suitable

confidence level. If we are comparing some new SBSE approach to the state of

the art, then we are asking a question as to whether the new approach is worthy

of consideration. In such a situation we may be happy with a 1 in 10 chance of

a Type I error (and could set the confidence level, accordingly, to be 90%). The

consequences of considering a move from the status quo may not be so great.

However, if we are considering whether to use a potently fatal drug on a

patient who may otherwise survive we might want a much higher confidence

that the drug would, indeed, improve the health of the patent over the status

quo (no treatment). For this reason it is important to think about what level of

confidence is suitable for the problem in hand.

The statistical test we perform will result in a p -value. The p -value is the

chance that a Type I error has occurred. That is, we notice that a sample of

runs produces a higher mean result for a measurement of interest for Algorithm

A than for Algorithm B . We wish to reject the so-called 'null hypothesis'; the

hypothesis that the population of all executions of Algorithm A is no different to

that of Algorithm B . To do this we perform an inferential statistical test. If all

the assumptions of the test are met and the sample of runs we have is unbiased

then the p -value we obtain indicates the chance that the populations of runs of

Algorithm A and Algorithm B are identical given the evidence we have from

the sample. For instance a p -value equal to or lower than 0.05 indicates that

we have satisfied the traditional (and somewhat ritualistic) 95% confidence level

test. More precisely, the chance of committing a Type I error is p .

This raises the question of how large a sample we should choose. The sample

size is related to the statistical power of our experiment. If we have too small a

sample then we may obtain high p -values and incorrectly conclude that there is

no significant difference between the two algorithms we are considering. This is

a so-called Type II error; we incorrectly accept the null hypothesis when it is,

in fact, false. In our case it would mean that we would incorrectly believe Algo-

rithm A to be no better than Algorithm B . More precisely, we would conclude,

correctly , that we have no evidence to claim that Algorithm A is significantly

better than Algorithm B at the chosen conference level. However, had we chosen

a larger sample, we may have had just such evidence. In general, all else being

equal, the larger the sample we choose the less likely we are to commit a Type

II error. This is why researchers prefer larger sample sizes where this is feasible.

Empirical Software Engineering and Verification

Search WWH ::

Custom Search

Home