Information Technology Reference
In-Depth Information
algorithm that an intelligent, informed personmight use if asked to solve the problem.
That is, one potential point of comparison is the first workable option that a reasonable
person might suggest.
It is critical that baselines be identified early in the research program. For example,
what is the point of developing newmethods if existing methods—or, perhaps worse,
trivial methods—provide a satisfactory solution? InworkwithWeb data, for example,
we found that problems we were experiencing with parsing of the text might in
principle be resolved by automatically determining which (European) language each
page was written in. To our surprise, a trivial method based on counting occurrences
of a small number of representative words (such as “the” for English or “der” for
German) gave 100% accuracy on our test data. Plans to investigate richer techniques
had to be abandoned.
Persuasive Data
For work that involves experiments, it is critical that you have access to appropriate
data, and that you understand it well. 1 In general terms, you need to consider:
￿
What data may be available, and whether it is created by you or sourced from
elsewhere.
￿
What specific mechanisms will be used to gather and standardize the data.
￿
Whether the data will be sufficient in volume or quality to give a robust answer to
the question.
￿
What domain knowledge may be required to properly interpret the data.
￿
What the limits, biases, flaws, and properties of the data are likely to be, and how
these problems will be addressed or managed.
￿
What the results will be like if the data supports the hypothesis; or, alternatively,
what they will be like if the hypothesis is false.
To understand these requirements, put yourself in the position of the reader. You
want to be persuaded that an algorithm is the very best option available, for a certain
class of problem. A test on inadequate data, or on data that has any of a range of
uncertainties, will leave you doubting the claims.
Consider a detailed example. A microarray is a technology that can be used to
cheaply obtain the genetic profile of a human, by identifying, for each of say a
million common genetic variations, whether the variation is present or absent. With
a large collection of individuals that have a microarray profile, and other phenome
data on the individuals, such as health status or physical characteristics, it is possible
1 In this discussion I generally use data in the usual sense in computing, namely as the rawmaterial
on which experiments operate. In other contexts, data is the result or output of an experiment, such
as measurements gathered in a lab or from human subjects. Confusingly, computing experiments
on data produce data as output. It is the output sense of the word data that is meant in the truism
“we process data to obtain information, analyze information to obtain knowledge, and comprehend
knowledge to obtain wisdom”.
 
Search WWH ::




Custom Search