Information Technology Reference
In-Depth Information
2 32 or 2 64 members. A constraint was that there were theoretical predictions only
for some table sizes and load factors. It would be possible to explore other table
sizes and input sizes, to seek a wide picture of behaviour, but intuitively it seemed
unlikely that different observations would be made. The hash functions were chosen
by randomly generating 32-bit seeds. The strings were chosen randomly from about
a dozen sources: text, programs, DNA fragments, binary files, and exhaustive sets of
strings of some given fixed length.
These strings are clearly not “typical”, even assuming we know what “typical”
means in such cases; there are many possible sources of strings, and who can say
which is most likely to be hashed. Had the behaviour of the functions varied signifi-
cantly between the different sources of string, it would have been difficult to draw any
meaningful conclusion; however, the behaviour for every set was virtually identical,
and moreover was indistinguishable from ideal. (Note, too, that this is an example
of a multi-variable experiment. The behaviour under each variable can be evaluated
by holding the others constant.)
As another example of the pitfalls of sampling and populations, consider natural-
language processing, with say the goal of evaluating the accuracy with which a
parser can identify nouns. The result depends on the input: perhaps tweets, or text
derived by optical-character recognition from printed material, or randomly chosen
Web pages, or articles from newspapers. Evaluation of typical accuracy depends
on assumptions about the population, and on the sampling process. A truly random
sample, if sufficiently large, should have in miniature all of the characteristics of the
population it represents.
Given that experiments should be based on explicit assumptions about underlying
populations and samples, some interesting consequences follow. Consider a thought
experiment: the simple task of measuring the average height of the students at a
particular university. Choose a sample of students at random, measure their height,
and take the average. It might be found that all the students in the sample, excepting
one or two outliers, are between 150 and 200 cm, with an average of 172. Now choose
one student at random. It should be obvious that the likelihood that this student's
height is average, say 172
1 cm, is low.
We thus conclude that a randomly chosen individual is likely to be atypical! By
the same reasoning, conclusions based on a single input, outcome, or event may well
be meaningless.
The importance of randomness in the selection or generation of inputs to tests
is something that is often missed by novice researchers. As counter-intuitive as it
may appear, it is better to pick examples randomly than to try systematically to select
“representative” examples, or—worse still—to let some other arbitrary factor (such as
date of creation, order in a file, or alphabetical sorting of identifier) make the selection.
The extent to which a collection of random samples is likely to be typical of the
population from which they are drawn can be estimated statistically from the samples
themselves; in contrast, the possible effects of a deterministic process are unknown.
Whether an average is a reasonable estimate of typical behaviour depends on
the kind of event being measured. It makes no sense to report average running time
when input size varies, for example. For evaluating the accuracy of noun detection of
±
Search WWH ::




Custom Search