Statistical Principles - Writing for Computer Science

Information Technology Reference

In-Depth Information

run on different sets of data; or whether the aim is evaluate the behaviour of one

system, or compare two.

Some researchers are deterred from considering statistics because of the math-

ematical analysis that may be involved. However, first, there are packages that do

much of the hard work. Second, many statistical problems can be couched in terms

of elementary probability and then resolved computationally. For example, consider

the problem of identifying the likelihood that a particular tennis player will win a

match, given that this player has an independent probability of 60 % of winning each

point. The rules are: a game is won when either player has at least four points and

a two-point advantage; a set is won when either player has at least six games and a

two-game advantage; a match is won when either player has a two-set advantage, or

by the winner of the fifth set. Determining the probability of a win is non-trivial for

the statistically innocent.

However, it is easy to write a program that runs a series of matches with a simple

random-number generator for determining who wins each point; over a few thousand

matches the average converges on a reliable value. A more computationally sensible

option is to run a large series of trials to determine the likelihood of winning a game,

then use this value as input to determination of likelihood of winning a set, and so

on. It is not difficult to check that such code provides reasonable answers and that

the values do indeed converge. 6 Many statistics can be estimated in this way; it was

used, for example, to confirm the outcomes of the string-hashing experiments and to

confirm the theoretical predictions for an ideal hash function, where random numbers

and 32-bit pieces of pi were used in place of hash values.

Randomness and Error

In some circumstances it is possible for an experiment to succeed, or at least appear

to succeed, by luck; there might be an atypical pattern to the data, or variations

in system response might favour one run over another. Where such variations are

possible, many runs should be made, to reduce the probability of accidental success

and (in the statistical sense) to give confidence in the results. This is particularly

true for timings, which can be affected by other users, system overheads, inability of

most operating systems to accurately allocate clock cycles to processes, and caching

effects.

For example, consider the apparently simple experiment of measuring how fast a

block can be accessed from a file stored on disk. Under a typical operating system, the

6 A typical guess of the likelihood of the better player winning the match is 90 or 95 %; in fact, the

likelihood is close to certainty.

When I once suggested to students that they test the code by running it with a probability of

50 % of winning each point, several argued strongly that the program wouldn't terminate—which

is more or less equivalent to arguing that, when tossing coins, you can't get some given number of

heads in a row. They had confused the short-term variability (any number of consecutive throws of

a head will come up eventually) with long-term averages. Such are the pitfalls of intuition.

Search WWH ::

Custom Search

Home