Information Technology Reference
In-Depth Information
certainly cannot do is identify the best parameters for all data sets, or even identify
whether there are stable best parameters to choose.
It is for this reason that descriptions of the research cycle distinguish between
an observation phase (used to learn about the object under study) and a testing or
confirmation phase (used to validate hypotheses). If parameters have been derived by
tuning, the only way to establish their validity is to see if they give good behaviour
on other data. Choosing parameters to suit data, or choosing data to suit parameters,
in all likelihood invalidates the research.
The research in some fields is underpinned by the availability and use of reference
data sets. Such resources can be dramatically larger and more comprehensive than
the materials that could be created by a typical research team, are easy to explain to
readers, and, in principle, allow the direct comparison of work between institutions
and between papers. In some instances, it can be difficult to publish work unless a
reference data set has been used. However, use of such data also carries risks, in
particular of overfitting; that is, methods can become so specialized that they do not
work on other data.
When considering what experiments to try, identify the data or input for which
the hypothesis is least likely to hold. These are the interesting cases: if they are not
tested—if only the cases where the hypothesis is most likely to hold are tested—then
the experiments won't prove much at all. The experiment should of course be a test
of the hypothesis; you need to verify that what you are testing is what you intended
to test, and an experiment should only succeed if the hypothesis is correct.
An underlying point, then, is that persuasive research requires appropriate data,
and thus you need to be confident that you can obtain good data before committing
to a particular research question. (In some fields, it may be that the research goal
is to obtain data: telescopes and particle accelerators are built to collect data, for
example. But, in computing, such research is extremely rare.) It follows that pursuit
of some questions, no matter how interesting they may be, will not feasible for some
researchers.
Ask whether a single data set is sufficient, or whether multiple data sets are
required: for separate training and testing, or for independent confirmation. A related
question is whether multiple data sets are indeed sufficiently independent; subsam-
ples of a single large data set may, for practical purposes, be the same, and not yield
the truly independent confirmation that is being sought.
Sometimes appropriate data can be artificial, or simulated; as noted in Chap. 4 ,
such data can allow a thorough exploration of the properties of an algorithm. But
such data should not be used without a clear understanding of its limitations. For
example, application of a new hash function to random data is unlikely to be a
convincing demonstration that the function is uniform, since the data was uniform
to begin with. Fundamentally, any scheme for generating artificial data relies on a
model, which embodies assumptions and, probably, simplifications. The strongest
defence of artificial data is to validate it against real data.
A related question is estimation of the volume of data required. Another way of
phrasing this same issue is: to what volumes of data should your claims apply? If you
are making claims about terabytes (say), but testing on megabytes, you are asking
Search WWH ::




Custom Search