What does it mean to say that 57% of the time, the series do not converge to the same aver-
age? Strictly speaking, it doesn't mean that we have 57% confidence that there is a 25% im-
provement in the result—it means only that we have 57% confidence that the results are dif-
ferent. There may be a 25% improvement, there may be a 125% improvement; it is even con-
ceivable that the specimen actually has worse performance than the baseline. The most prob-
able likelihood is that the difference in the test is similar to what has been measured (particu-
larly as the p -value goes down), but certainty can never be achieved.
STATISTICS AND SEMANTICS
The correct way to present results of a t-test is to phrase a statement like this: there is a 57% prob-
ability that the specimen differs from the baseline, and the best estimate of that difference is 25%.
The common way to present these results is to say that there is a 57% confidence level that there
is a 25% improvement in the results. Though that isn't exactly the same thing and will drive stat-
isticians crazy, it is easy shorthand to adopt and isn't that far off the mark. Probabilistic analysis
always involves some uncertainty, and that uncertainty is better understood when the semantics
are precisely stated. But particularly in an arena where the underlying issues are well understood,
some semantic shortcuts will inevitably creep in.
The t-test is typically used in conjunction with an α-value , which is a (somewhat arbitrary)
point at which the the result is assumed to have statistical significance. The α -value is com-
monly set to 0.1—which means that a result is considered statistically significant if it means
that the specimen and baseline will be the same only 10% (0.1) of the time (or conversely,
that 90% of the time there is a difference between the specimen and baseline). Other com-
monly used α -values are 0.5 (95%) or 0.01 (99%). A test is considered statistically signific-
ant if the p -value is smaller than 1 - α -value.
Hence, the proper way to search for regressions in code is to determine a level of statistical
significance—say, 90%—and then to use the t-test to determine if the specimen and baseline
are different within that degree of statistical significance. Care must be taken to understand
what it means if the test for statistical significance fails. In the example, the p -value is 0.43;
we cannot say that that there is statistical significance within a 90% confidence level that the
result indicates that the averages are different. The fact that the test is not statistically signi-
ficant does not mean that it is an insignificant result; it simply means that the test is incon-