Information Technology Reference
In-Depth Information
to the stochastic nature of the algorithm and so we are not in a position to say
exactly what the value obtained will be. Rather, we can give a range of values.
However, it is almost always impractical to perform all possible runs and so
we have to sample. Our '95% confidence claim' is that we are 95% confident that
the evidence provided by our sample allows us to infer a conclusion about the
algorithm's performance on the whole population. This is why this branch of
statistics is referred to as 'inferential statistics'; we infer properties of a whole
population based on a sample.
Unfortunately a great deal of 'ritualistic' behaviour has grown up around the
experimental sciences, in part, resulting for an inadequate understanding of the
underlying statistics. One aspect of this ritual is found in the choice of a suitable
confidence level. If we are comparing some new SBSE approach to the state of
the art, then we are asking a question as to whether the new approach is worthy
of consideration. In such a situation we may be happy with a 1 in 10 chance of
a Type I error (and could set the confidence level, accordingly, to be 90%). The
consequences of considering a move from the status quo may not be so great.
However, if we are considering whether to use a potently fatal drug on a
patient who may otherwise survive we might want a much higher confidence
that the drug would, indeed, improve the health of the patent over the status
quo (no treatment). For this reason it is important to think about what level of
confidence is suitable for the problem in hand.
The statistical test we perform will result in a p -value. The p -value is the
chance that a Type I error has occurred. That is, we notice that a sample of
runs produces a higher mean result for a measurement of interest for Algorithm
A than for Algorithm B . We wish to reject the so-called 'null hypothesis'; the
hypothesis that the population of all executions of Algorithm A is no different to
that of Algorithm B . To do this we perform an inferential statistical test. If all
the assumptions of the test are met and the sample of runs we have is unbiased
then the p -value we obtain indicates the chance that the populations of runs of
Algorithm A and Algorithm B are identical given the evidence we have from
the sample. For instance a p -value equal to or lower than 0.05 indicates that
we have satisfied the traditional (and somewhat ritualistic) 95% confidence level
test. More precisely, the chance of committing a Type I error is p .
This raises the question of how large a sample we should choose. The sample
size is related to the statistical power of our experiment. If we have too small a
sample then we may obtain high p -values and incorrectly conclude that there is
no significant difference between the two algorithms we are considering. This is
a so-called Type II error; we incorrectly accept the null hypothesis when it is,
in fact, false. In our case it would mean that we would incorrectly believe Algo-
rithm A to be no better than Algorithm B . More precisely, we would conclude,
correctly , that we have no evidence to claim that Algorithm A is significantly
better than Algorithm B at the chosen conference level. However, had we chosen
a larger sample, we may have had just such evidence. In general, all else being
equal, the larger the sample we choose the less likely we are to commit a Type
II error. This is why researchers prefer larger sample sizes where this is feasible.
Search WWH ::




Custom Search