Information Technology Reference
In-Depth Information
the generally preferred two-tailed test, we would have to calculate the likelihood of
9 or more tosses of any one side, heads or tails; a more demanding test, because the
probability is twice that of heads alone.
The calculated probability is known as the p -value of our test. If this p -value is
below some specified significance level, often denoted
, then we achieve statistical
significance, reject the null hypothesis, and accept the alternative: that the coin is
biased. (Specifying that
α
05 is a common, but not particularly strict, threshold. 4 )
α =
0
.
If the
threshold is not achieved, we cannot reject the null hypothesis, the result is
not significant, and our experiment is inconclusive.
The coin-flipping example also illustrates another important principle, which is
that the larger the sample size, the easier it is to find significance. Observing that
three-quarters or more of the tosses are heads has a p -value of 7.3 % for 12 tosses,
3.8 % for 16 tosses, and 1.1 % for 24 tosses. This relates to the statistical concept
of power : to reliably observe a slight, but real, effect (say, that our coin comes up
heads 55 % of the time) requires far more trials than is required to observe a more
pronounced effect, such as coming up heads 95 % of the time. If we have an estimate
of the size of the effect we are seeking, power calculations allow us to estimate the
number of trials required to observe it. 5
Hypothesis tests are used, then, to investigate whether improvements are signif-
icant. It is often the case that, in a series of comparisons of two techniques for the
same task, one is better than the other some but not all of the time. In statistical terms,
in such a case the researcher needs to determine whether the two sets of results—two
samples—are drawn from the same population.
We may have experimentally determined, for example, that new is faster than old
“on average”. That is, perhaps new was faster than old on balance when measured
over a variety of inputs, or was faster in the majority of runs on the same input. In
many experiments, execution times can vary substantially from one run to the next,
for all the reasons discussed earlier—the layout of a file on disk, for example, could
be different each time it is constructed, due to operating system variables.
Whatever the cause of the variability, this experiment is based on two sets of times,
one for new and one for old. But suppose that we have a large population of running
times for new alone, and we draw two samples from this population. It is unlikely
that the two samples will have identical averages. Either we conclude that new is
faster than itself, or that new and old might in fact not be meaningfully different.
This problem is particularly acute when in some cases old is faster (by, say, only a
small margin) and in other cases new is faster (by a large margin).
The issue can be resolved with a hypothesis test that compares the distribution
of observations. Consider the figures in Fig. 15.1 . Both of the graphs show a pair
α
4
is appropriate. Determining which of (say)
1,000,000 genetic variations is significantly linked to a particular property (such as susceptibility to
a certain disease) might require
There are many instances when much smaller
α
10 10 , or smaller. There is an extensive literature on estimation
α<
of
in different contexts.
5 It is astonishing how many papers report work in which a slight effect is investigated with a small
number of trials. Given that such investigations would generally fail even if the hypothesis was
correct, it seems likely that many interesting research questions are unnecessarily discarded.
α
Search WWH ::




Custom Search