Experimentation - Writing for Computer Science

Information Technology Reference

In-Depth Information

collection, perhaps, and normalized discounted cumulative gain for another), and

rests on two false assumptions: first, that the measures are independent of each other,

that is, are assessing distinct qualities; and second, that they are of equal value.

A key concept here is of predictivity . The main reason that we experiment and

measure is to provide evidence about the behaviour of a system in general—not just

on some specific data set. That is, we use measurements on the data we have to hand

tomake predictive claims about what will happen in the future, when the same system

is applied to new data; the conclusions in our papers are usually about properties of

systems, not their behaviour on the data we have already seen.

Some measures are more predictive than others, however. To take a concocted

example, we could measure a system for translating text between languages by com-

paring automatic translations to human translations, and counting how many words

the automatic and human translations have in common. Alternatively we could rate

the automatic translation by how many characters it has—the closer it is in length

to the human translation, the better. The method based on words in common should

be reasonably predictive: if system A is 30% better than system B on 1,000 sample

translations, then we could reasonably expect A to have better commonality-based

scores than B on the next 1,000 translations. But suppose that B was better than A

according to the length-based measure on the samples. In all likelihood, we would

not expect this to predict length-based performance on the new translations; indeed,

commonality probably predicts length better than length predicts length.

Where two different measures do assess distinct qualities, however, it is good

practice to report both, particularly where the measures and their underlying qualities

are in tension. To give a simple example, in classification both of false positives and

true positives are informative, and methods that reduce the first (good) also tend to

reduce the second (not so good).

Robustness

Experiments should as far as possible be independent of the accuracy of measure-

ments or quality of the implementation. Ideally an experiment should be designed to

yield a result that is unambiguously either true or false; where this is not possible,

another form of confirmation is to demonstrate a trend or pattern of behaviour.

A simple example is the behaviour of query evaluation on a database system with

and without indexes. For a small database, the most efficient solution is exhaustive

search, because use of an index involves access to auxiliary structures and does not

greatly reduce the cost of accessing the data. As database size grows, the cost of

data access grows linearly, while index access costs may be more or less fixed. Thus

the hypothesis “indexes reduce search costs in large databases” can be confirmed by

experimentsmeasuring search costswith andwithout indexes over a range of database

sizes. The trend—that the advantage given by indexes increases with database size—

is independent of the machine and data. The exact size at which indexes become ben-

eficial will vary, but this value is not being studied; it is the trend that is being studied.

Search WWH ::

Custom Search

Home