Search Based Software Engineering: Techniques, Taxonomy, Tutorial - Empirical Software Engineering and Verification

Information Technology Reference

In-Depth Information

to discern whether one set of experiments are significantly different in some

aspect from another.

Suppose there are two approaches to address a problem in SBSE, each of

which involves the application of some search based algorithm to a set of problem

instances. We collect results for the application of both algorithms, A and B ,

and we notice that, over a series of runs of our experiment, Algorithm A tends

to perform better than Algorithm B . The performance we have in mind, may

take many forms. It may be that Algorithm A is faster than B ,orthataftera

certain number of fitness evaluations it has achieved a higher fitness value, or

a higher average fitness. Alternatively, there may be some other measurement

that we can make about which we notice a difference in performance that we

believe is worth reporting.

In such cases, SBSE researchers tend to rely on inferential statistics as a means

of addressing the inherently stochastic nature of search based algorithms. That

is, we may notice that the mean fitness achieved by Algorithm A is higher than

that of Algorithm B after 10 executions of each, but how can we be sure that this

is not merely an observation arrived at by chance ? It is to answer precisely these

kinds of question that statistical hypothesis testing is used in the experimental

sciences, and SBSE is no exception.

A complete explanation of the issues and techniques that can be used in ap-

plying inferential statistics in SBSE is beyond the scope of this tutorial. However,

there has been a recent paper on the topic of statistical testing of randomised

algorithms by Arcuri and Briand [8], which provides more detail. In this section

we provide an overview of some of the key points of concern.

The typical scenario with which we are concerned is one in which we want to

explore the likelihood that our experiments found that Algorithm A outperforms

Algorithm B purely by chance. Usually we wish to be in a position to make a

claim that we have evidence that suggests that Algorithm A is better than

Algorithm B . For example, as a sanity check, we may wish to show that our

SBSE technique comfortably outperforms a random search. But what do we

mean by 'comfortably outperforms'?

In order to investigate this kind of question we set a threshold on the degree

of chance that we find acceptable. Typically, in the experimental sciences, this

level is chosen to be either 1% or 5%. That is, we will have either a less than 1

in 100 or a less than 5 in 100 chance of believing that Algorithm A outperforms

Algorithm B based on a set of executions when in fact it does not. This is the

chance of making a so-called 'Type I' error. It would lead to us concluding that

some Algorithm A was better than Algorithm B when, in fact, it was not.

If we choose a threshold for error of 5% then we have a 95% confidence level in

our conclusion based on our sample of the population of all possible executions

of the algorithm. That is, we are '95% sure that we can claim that Algorithm A

really is better than Algorithm B'. Unpacking this claim a little, what we find

is that there is a population involved. This is the population of all possible runs

of the algorithm in question. For each run we may get different behaviour due

Empirical Software Engineering and Verification

Search WWH ::

Custom Search

Home