Statistical Principles - Writing for Computer Science

Information Technology Reference

In-Depth Information

the generally preferred two-tailed test, we would have to calculate the likelihood of

9 or more tosses of any one side, heads or tails; a more demanding test, because the

probability is twice that of heads alone.

The calculated probability is known as the p -value of our test. If this p -value is

below some specified significance level, often denoted

, then we achieve statistical

significance, reject the null hypothesis, and accept the alternative: that the coin is

biased. (Specifying that

α

05 is a common, but not particularly strict, threshold. 4 )

α =

0

.

If the

threshold is not achieved, we cannot reject the null hypothesis, the result is

not significant, and our experiment is inconclusive.

The coin-flipping example also illustrates another important principle, which is

that the larger the sample size, the easier it is to find significance. Observing that

three-quarters or more of the tosses are heads has a p -value of 7.3 % for 12 tosses,

3.8 % for 16 tosses, and 1.1 % for 24 tosses. This relates to the statistical concept

of power : to reliably observe a slight, but real, effect (say, that our coin comes up

heads 55 % of the time) requires far more trials than is required to observe a more

pronounced effect, such as coming up heads 95 % of the time. If we have an estimate

of the size of the effect we are seeking, power calculations allow us to estimate the

number of trials required to observe it. 5

Hypothesis tests are used, then, to investigate whether improvements are signif-

icant. It is often the case that, in a series of comparisons of two techniques for the

same task, one is better than the other some but not all of the time. In statistical terms,

in such a case the researcher needs to determine whether the two sets of results—two

samples—are drawn from the same population.

We may have experimentally determined, for example, that new is faster than old

“on average”. That is, perhaps new was faster than old on balance when measured

over a variety of inputs, or was faster in the majority of runs on the same input. In

many experiments, execution times can vary substantially from one run to the next,

for all the reasons discussed earlier—the layout of a file on disk, for example, could

be different each time it is constructed, due to operating system variables.

Whatever the cause of the variability, this experiment is based on two sets of times,

one for new and one for old. But suppose that we have a large population of running

times for new alone, and we draw two samples from this population. It is unlikely

that the two samples will have identical averages. Either we conclude that new is

faster than itself, or that new and old might in fact not be meaningfully different.

This problem is particularly acute when in some cases old is faster (by, say, only a

small margin) and in other cases new is faster (by a large margin).

The issue can be resolved with a hypothesis test that compares the distribution

of observations. Consider the figures in Fig. 15.1 . Both of the graphs show a pair

α

4

is appropriate. Determining which of (say)

1,000,000 genetic variations is significantly linked to a particular property (such as susceptibility to

a certain disease) might require

There are many instances when much smaller

α

10 − 10 , or smaller. There is an extensive literature on estimation

α<

of

in different contexts.

5 It is astonishing how many papers report work in which a slight effect is investigated with a small

number of trials. Given that such investigations would generally fail even if the hypothesis was

correct, it seems likely that many interesting research questions are unnecessarily discarded.

α

Writing for Computer Science

Search WWH ::

Custom Search

Home