Statistical Principles - Writing for Computer Science

Information Technology Reference

In-Depth Information

for whom other medications have failed. Designing an experiment includes deciding

what the possible, or reasonable, inputs are, and how they might be varied.

A straightforward case of varying inputs is where this is due to randomization.

Consider a classifier that achieves a certain effectiveness score on a collection of

100,000 items, with 10,000 of the items randomly selected for training and 90,000

for the test; obviously (I hope it is obvious), the effectiveness will differ for different

random partitionings of the items into training and test sets. The correct approach

is to perform multiple randomizations, to yield a distribution of outcomes that can

be analyzed statistically. In other cases, explicit randomization is not available. For

example, we may be constrained to certain real-world data sets. Any activity involving

human judgement will often give different results for different humans—or even for

the same human at different times.

Another form of variability in input is where the task itself is qualitatively changed.

For instance, a search engine will achieve the same effectiveness on a given query

and collection of Web pages each time it is run, but will achieve different scores

for different queries or different collections—which may be due to the new data

changing the nature of the task. In cases where the tasks change qualitatively, it may

not be meaningful to talk about a population, sample, or average, so other strategies

for interpretation of results may be appropriate.

Aggregation and Variability

A simple way to aggregate the multiple scores achieved by running an experiment

multiple times is to report an average of the outcomes. Typically, the average is

reported as the (arithmetic) mean, though in some circumstances—where the results

are highly skewed, for instance, or there appear to be outliers—the median, or some

sort of trimmed mean, may be preferable. The average score is more stable and

representative of the experiment than any single score: it is a better predictor of what

score would be achieved if the experiment were run again, and it provides a more

stable basis for comparison of results between different experimental setups.

Consider how to identify the likely worst case of a particular class of string-

hashing functions—that is, to find out how many strings might conceivably hash to

the same slot in practice. We faced this problem in work on string hashing, where

analysis of the functions would be suspect. (The distribution of characters in words

is not uniform, and the distribution varies according to character position within

words. Thus an analysis would involve assumptions that could easily be confounded

in practice.) Theory tells us that the worst case of all strings hashing to the same slot

is ridiculously improbable for an ideal function; the specific question is to identify

how close to ideal a randomly chosen member of a class of hash functions is.

In evaluating the properties of string-hashing functions, there are several variables

in the population of inputs: the hash-table size, the number of input strings, the

strings themselves, and a hash function chosen from the class. In the particular

class we were considering, hash functions are determined by seed value, so the

class is finite for an efficient implementation in a language such as C, with say

Search WWH ::

Custom Search

Home