Information Technology Reference
In-Depth Information
to discern whether one set of experiments are significantly different in some
aspect from another.
Suppose there are two approaches to address a problem in SBSE, each of
which involves the application of some search based algorithm to a set of problem
instances. We collect results for the application of both algorithms, A and B ,
and we notice that, over a series of runs of our experiment, Algorithm A tends
to perform better than Algorithm B . The performance we have in mind, may
take many forms. It may be that Algorithm A is faster than B ,orthataftera
certain number of fitness evaluations it has achieved a higher fitness value, or
a higher average fitness. Alternatively, there may be some other measurement
that we can make about which we notice a difference in performance that we
believe is worth reporting.
In such cases, SBSE researchers tend to rely on inferential statistics as a means
of addressing the inherently stochastic nature of search based algorithms. That
is, we may notice that the mean fitness achieved by Algorithm A is higher than
that of Algorithm B after 10 executions of each, but how can we be sure that this
is not merely an observation arrived at by chance ? It is to answer precisely these
kinds of question that statistical hypothesis testing is used in the experimental
sciences, and SBSE is no exception.
A complete explanation of the issues and techniques that can be used in ap-
plying inferential statistics in SBSE is beyond the scope of this tutorial. However,
there has been a recent paper on the topic of statistical testing of randomised
algorithms by Arcuri and Briand [8], which provides more detail. In this section
we provide an overview of some of the key points of concern.
The typical scenario with which we are concerned is one in which we want to
explore the likelihood that our experiments found that Algorithm A outperforms
Algorithm B purely by chance. Usually we wish to be in a position to make a
claim that we have evidence that suggests that Algorithm A is better than
Algorithm B . For example, as a sanity check, we may wish to show that our
SBSE technique comfortably outperforms a random search. But what do we
mean by 'comfortably outperforms'?
In order to investigate this kind of question we set a threshold on the degree
of chance that we find acceptable. Typically, in the experimental sciences, this
level is chosen to be either 1% or 5%. That is, we will have either a less than 1
in 100 or a less than 5 in 100 chance of believing that Algorithm A outperforms
Algorithm B based on a set of executions when in fact it does not. This is the
chance of making a so-called 'Type I' error. It would lead to us concluding that
some Algorithm A was better than Algorithm B when, in fact, it was not.
If we choose a threshold for error of 5% then we have a 95% confidence level in
our conclusion based on our sample of the population of all possible executions
of the algorithm. That is, we are '95% sure that we can claim that Algorithm A
really is better than Algorithm B'. Unpacking this claim a little, what we find
is that there is a population involved. This is the population of all possible runs
of the algorithm in question. For each run we may get different behaviour due
 
Search WWH ::




Custom Search