Information Technology Reference
In-Depth Information
collection, perhaps, and normalized discounted cumulative gain for another), and
rests on two false assumptions: first, that the measures are independent of each other,
that is, are assessing distinct qualities; and second, that they are of equal value.
A key concept here is of predictivity . The main reason that we experiment and
measure is to provide evidence about the behaviour of a system in general—not just
on some specific data set. That is, we use measurements on the data we have to hand
tomake predictive claims about what will happen in the future, when the same system
is applied to new data; the conclusions in our papers are usually about properties of
systems, not their behaviour on the data we have already seen.
Some measures are more predictive than others, however. To take a concocted
example, we could measure a system for translating text between languages by com-
paring automatic translations to human translations, and counting how many words
the automatic and human translations have in common. Alternatively we could rate
the automatic translation by how many characters it has—the closer it is in length
to the human translation, the better. The method based on words in common should
be reasonably predictive: if system A is 30% better than system B on 1,000 sample
translations, then we could reasonably expect A to have better commonality-based
scores than B on the next 1,000 translations. But suppose that B was better than A
according to the length-based measure on the samples. In all likelihood, we would
not expect this to predict length-based performance on the new translations; indeed,
commonality probably predicts length better than length predicts length.
Where two different measures do assess distinct qualities, however, it is good
practice to report both, particularly where the measures and their underlying qualities
are in tension. To give a simple example, in classification both of false positives and
true positives are informative, and methods that reduce the first (good) also tend to
reduce the second (not so good).
Robustness
Experiments should as far as possible be independent of the accuracy of measure-
ments or quality of the implementation. Ideally an experiment should be designed to
yield a result that is unambiguously either true or false; where this is not possible,
another form of confirmation is to demonstrate a trend or pattern of behaviour.
A simple example is the behaviour of query evaluation on a database system with
and without indexes. For a small database, the most efficient solution is exhaustive
search, because use of an index involves access to auxiliary structures and does not
greatly reduce the cost of accessing the data. As database size grows, the cost of
data access grows linearly, while index access costs may be more or less fixed. Thus
the hypothesis “indexes reduce search costs in large databases” can be confirmed by
experimentsmeasuring search costswith andwithout indexes over a range of database
sizes. The trend—that the advantage given by indexes increases with database size—
is independent of the machine and data. The exact size at which indexes become ben-
eficial will vary, but this value is not being studied; it is the trend that is being studied.
 
Search WWH ::




Custom Search