Information Technology Reference
In-Depth Information
for whom other medications have failed. Designing an experiment includes deciding
what the possible, or reasonable, inputs are, and how they might be varied.
A straightforward case of varying inputs is where this is due to randomization.
Consider a classifier that achieves a certain effectiveness score on a collection of
100,000 items, with 10,000 of the items randomly selected for training and 90,000
for the test; obviously (I hope it is obvious), the effectiveness will differ for different
random partitionings of the items into training and test sets. The correct approach
is to perform multiple randomizations, to yield a distribution of outcomes that can
be analyzed statistically. In other cases, explicit randomization is not available. For
example, we may be constrained to certain real-world data sets. Any activity involving
human judgement will often give different results for different humans—or even for
the same human at different times.
Another form of variability in input is where the task itself is qualitatively changed.
For instance, a search engine will achieve the same effectiveness on a given query
and collection of Web pages each time it is run, but will achieve different scores
for different queries or different collections—which may be due to the new data
changing the nature of the task. In cases where the tasks change qualitatively, it may
not be meaningful to talk about a population, sample, or average, so other strategies
for interpretation of results may be appropriate.
Aggregation and Variability
A simple way to aggregate the multiple scores achieved by running an experiment
multiple times is to report an average of the outcomes. Typically, the average is
reported as the (arithmetic) mean, though in some circumstances—where the results
are highly skewed, for instance, or there appear to be outliers—the median, or some
sort of trimmed mean, may be preferable. The average score is more stable and
representative of the experiment than any single score: it is a better predictor of what
score would be achieved if the experiment were run again, and it provides a more
stable basis for comparison of results between different experimental setups.
Consider how to identify the likely worst case of a particular class of string-
hashing functions—that is, to find out how many strings might conceivably hash to
the same slot in practice. We faced this problem in work on string hashing, where
analysis of the functions would be suspect. (The distribution of characters in words
is not uniform, and the distribution varies according to character position within
words. Thus an analysis would involve assumptions that could easily be confounded
in practice.) Theory tells us that the worst case of all strings hashing to the same slot
is ridiculously improbable for an ideal function; the specific question is to identify
how close to ideal a randomly chosen member of a class of hash functions is.
In evaluating the properties of string-hashing functions, there are several variables
in the population of inputs: the hash-table size, the number of input strings, the
strings themselves, and a hash function chosen from the class. In the particular
class we were considering, hash functions are determined by seed value, so the
class is finite for an efficient implementation in a language such as C, with say
 
Search WWH ::




Custom Search