The third principle involves understanding how test results vary over time. Programs that
process exactly the same set of data will produce a different answer each time they are run.
Background processes on the machine will affect the application, the network will be more
or less congested when the program is run, and so on. Good benchmarks also never process
exactly the same set of data each time they are run; there will be some random behavior built
into the test to mimic the real world. This creates a problem: when comparing the result from
one run to the result from another run, is the difference due to a regression, or due to the ran-
dom variation of the test?
This problem can be solved by running the test multiple times and averaging those results.
Then when a change is made to the code being tested, the test can be rerun multiple times,
the results averaged, and then the two averages compared. It sounds so easy.
Unfortunately, it isn't quite as simple as that. Understanding when a difference is a real re-
gression and when it is a random variation is difficult—and it is a key area where science
leads the way, but art will come into play.
When averages in benchmark results are compared, it is impossible to know with certainty
whether the difference in the averages is real, or whether it is due to random fluctuation. The
best that can be done is to hypothesize that “the averages are the same” and then determine
the probability that such a statement is true. If the statement is false with a high degree of
probability, then we are comfortable in believing the difference in the averages (though we
can never be 100% certain).
Testing code for changes like this is called regression testing. In a regression test, the original
code is known as the baseline and the new code is called the specimen. Take the case of a
batch program where the baseline and specimen are each run three times, yielding the times
given in Table 2-2 .