Information Technology Reference
In-Depth Information
A commonly reported descriptive statistic is standard deviation. The benefits of the
standard deviation are that it quantifies variability in a single value, which are in the
same units as the mean, and that it is a key input to statistical inference (as discussed
later). It also has a special meaning for certain distributions, in particular the normal
distribution, where mean and standard deviation fully specify the distribution's shape.
The variance is sometimes reported instead of the standard deviation, the latter being
the square root of the former, but variance is generally less easy to interpret.
An alternative is to report quantiles, such as the 25th to 75th percentiles, or (by
analogy with a 95 % confidence interval) the 2.5th to 97.5th percentiles. Quantile
ranges are most naturally combined with the median, itself the 50th percentile, rather
than with the mean. If one is intending to report quantiles, note that an odd number
of experimental runs is preferable to an even number, since the middle run will be
the median; similarly, for a set of 21 experimental runs, the 6th is the 25th percentile,
and the 16th is the 75th percentile. The more extreme the percentile, the greater the
number of runs necessary to attain stable results. In some circumstances, a fuller
description of the distribution of scores may need to be reported, in a form such as
a graph, histogram, or box-and-whisker plot.
Average scores should only be reported with a precision that corresponds to the
accuracy of the average. If only a few instances of a highly variable phenomenon are
observed, then reporting many decimal places gives a false impression of exactness.
For instance, if five runs of an experiment give timings of 1
.
143, 0
.
918, 1
.
398, 1
.
535,
and 1
161 s”
makes the result seem much more precise than it really is. In this example, the
standard error (the standard deviation divided by the square root of the number of
observations) is 0
.
049 s, then to say that “the average running time of our algorithm is 1
.
.
116, so the average should be stated as 1
.
2 s. If you want to provide
greater precision, you need to run more experiments.
As noted above, variability in inputs or environment is distinct from variability
in tasks. An example of the former is to randomly select training examples for a
classifier, or randomly generated relational tables for a distributed system; an exam-
ple of the latter is the existence of different sources of English-language sentences
for a parser. In both cases, an average score can be calculated across experimental
instances, though with different meanings. In the former case, of variability of inputs
or environment, the concept of an average score is meaningful. We somehow believe
that there is such a thing (on a fixed dataset and hardware) as an average running time
for our program, even though each individual running time varies. It is much less
clear that there is such an entity as an average English sentence, not only because
there are so many ways in which text is created, but also because the universe of
possible sentences is poorly defined. Averaging of scores across tasks can still be
useful, for instance for comparing the performance of two algorithms; but we should
be reluctant to claim that the averaged score represents anything beyond the confines
of the particular experiment.
Search WWH ::




Custom Search