Replication of Software Engineering Experiments - Empirical Software Engineering and Verification

Information Technology Reference

In-Depth Information

Voices in favour of experimentalism as a way of research about software de-

velopment have recently grown stronger. DeMarco [3] claims that “The actual

software construction isn't necessarily experimental, but its conception is. And

this is where our focus ought to be. It's where our focus always ought to have

been”. Meyer [4, 5] has also joined the line of researchers to point to the impor-

tance of experimentation in SE.

A key component of experimentation is replication. To consolidate a body of

knowledge built upon experimental results, they have to be extensively verified.

This verification is carried out by replicating an experiment to check if its results

can be reproducible. If the same results are reproduced in different replications,

we can infer that such results are regularities existing in the piece of reality under

study. Experimenters acquainted with such regularities can find out mechanisms

regulating the observed results or, at least, predict their behaviour.

Most of the events observed through experiments in SE nowadays are isolated.

In other words, most SE experiments results have not been reproduced. So there

is no way to distinguish the following three situations: the results were produced

by chance (the event occurred accidentally); the results are artifactual (the event

only occurs in the experiment not in the reality under study), or the results really

do conform to a regularity of the piece of reality being examined.

A replication has some elements in common with its baseline experiment.

When we start to examine a phenomenon experimentally, most aspects are un-

known. Even the tiniest change in a replication can lead to inexplicable differ-

ences in the results. In immature experimental disciplines, which experimental

conditions should be controlled can be found out by starting off with replications

closely following the baseline experiment [6]. In the case of well-known phenom-

ena, the experimental conditions that influence the results can be controlled,

and artifactual results are identified by running less similar replications. For ex-

ample, using different experimental protocols to verify the results correspond to

experiment-independent events.

The immaturity of ESE has been an obstacle to replication. As the mech-

anisms regulating software development and the key experimental conditions

for its investigation are yet unknown, even the slightest change in the replica-

tion leads to inexplicable differences in the results. However, context differences

oblige experimenters to adapt the experiment. These changes can lead to sizeable

differences in the replication results that prevent the outcomes of the baseline

experiment from being corroborated. In several attempts at combining the re-

sults of ESE replications, Hayes [7], Miller [8-10], Hannay et al. [11], Jørgensen

[12], Pickard et al. [13], Shull et al. [14] and Juristo et al. [15] reported that the

differences between results were so large that they found it impossible to draw

any consequences from the results comparison.

ESE stereotype of replication is an experiment that is repeated independently

by other researchers at different sites to the baseline experiment. But some of the

replications in ESE do not conform to this stereotype: either they are jointly run,

or replicators researchers reuse some of the materials employed in the baseline

experiment or they are run at the same site [16-25]. How replications should be

Empirical Software Engineering and Verification

Search WWH ::

Custom Search

Home