Java Reference

In-Depth Information

Table 2-2. Hypothetical times to execute two tests

Baseline

Specimen

First iteration

1.0 seconds 0.5 seconds

Second iteration 0.8 seconds 1.25 seconds

Third iteration 1.2 seconds 0.5 seconds

Average

1 seconds

0.75 seconds

The average of the specimen says there is a 25% improvement in the code. How confident

can we be that the test really reflects a 25% improvement? Things look good: two of the

three specimen values are less than the baseline average, and the size of the improvement is

large—yet when the analysis described in this section is performed on those results, it turns

out that the probability the specimen and the baseline have the same performance is 43%.

When numbers like these are observed, 43% of the time the underlying performance of the

two tests are the same. Hence, performance is different only 57% of the time. This, by the

way, is not exactly the same thing as saying that 57% of the time the performance is 25%

better, but more about that a little later.

The reason these probabilities seem different than might be expected is due to the large vari-

ation in the results. In general, the larger the variation in a set of results, the harder it is to

guess the probability that the difference in the averages is real or due to random chance.

This number—43%—is based on the result of Student's t-test, which is a statistical analysis

based on the series and their variances. Student, by the way, is the pen name of the scientist

who first published the test; it isn't named that way to remind you of graduate school where

you (or at least I) slept through statistics class. The t-test produces a number called the
p-

value
, which refers to the probability that the null hypothesis for the test is false. (There are

several programs and class libraries that can calculate t-test results; the numbers produced in

this section come from using the
TTest
class of the Apache Commons Mathematics Library.)

The null hypothesis in regression testing is the hypothesis that the two tests have equal per-

formance. The
p
-value for this example is roughly 43%, which means the confidence we can

have that the series converge to the same average is 43%. Conversely, the confidence we

have that the series do not converge to the same average is 57%.