Information Technology Reference
In-Depth Information
the reader to believe that your results can be extrapolated a million-fold. Yet, for
some problems, merely doubling the volume of data can introduce new challenges.
Such issues arise in testing of techniques such as document search methods. They
also arise in challenging ways in the context of algorithms for analysis of DNA,
that is, the strings representing genomes. These algorithms have to contend with
several different forms of scale. One form is the length of the strings, which for
a single organism can vary from a few thousand characters (viruses) or millions
(bacteria) to billions (a vertebrate such a human) or over a hundred billion (some
plants). Another form is the complexity of the genome; some contain a great deal of
internal redundancy or copying. Yet another form is the number of organisms being
simultaneously analyzed. A further, more subtle form of scale is the evolutionary
distance between the individual organisms—here, the dimension is of timescale or
diversity, rather than raw data volume. Each of these forms of scale has significant,
non-linear impact on behaviour of commonly used bioinformatics algorithms.
The question of data volume arises in the formal statistical sense of power: whether
your data is of sufficient quantity, or quality, to allowobservation of the effect that you
are seeking. For example, if you are comparing two parsers according to their ability
to accurately extract phrases from English text, it may be that statistical principles
will tell you that a collection of 100 examples is unlikely to be sufficient for the
anticipated improvement to be detected.
Interpretation
When checking experimental design or outcomes, consider whether there are other
possible interpretations of the results; and, if so, design further tests to eliminate these
possibilities. Consider for instance the problem of finding whether a file stored on
disk contains a given string. One algorithm directly scans the file; another algorithm,
which has been found to give faster response, scans a compressed form of the file.
Further tests would be needed to identify whether the speed gain was because the
second algorithm used fewer machine cycles or because the compressed file was
fetched more quickly from disk.
Care is particularly needed when checking the outcome of negative or failed
experiments. A reader of the statement “we have shown that it is not possible to
make further improvement” may wonder whether what has actually been shown is
that the author is not competent to make further improvement. Moreover, the failure
of an experiment typically leads to it being redesigned—such failure is as likely to
expose problems in the tests as in the hypothesis itself. Design of experiments to
demonstrate the failure of the hypothesis is particularly challenging.
It is always worth considering whether the results obtained are sensible. For exam-
ple, are there rules of conservation that should apply to the experiment? This issue
was illustrated by one of my students, who was evaluating a classification method
in which documents were automatically allocated to one of several predetermined
categories. She reported numbers of true and false positives and negatives, as a
 
Search WWH ::




Custom Search