Information Technology Reference
In-Depth Information
Domain knowledge : A failure in some computer science research is that it is
inspired by a real-world problem, such as genetic linkage, but the problem has
been abstracted or simplified in some way that makes it unrealistic, and thus the
methods that are developed have no practical relevance. For example, there would
be little value to a linkage algorithm that assumed that microarray data was free
of error.
A similar failure is reporting of unvalidated results, such as a researcher reporting
that linkages have been found, but which are biologically implausible. There is
a detailed understanding of biological roles of the components of the human
genome, and this domain knowledge should be part of the interpretation of the
results.
Limits : Microarrays are noisy; values may be incorrect or uncertain, and the
amount of noise in biological experiments can vary fromone laboratory to another.
Likewise, the human processes of gathering phonemic data are also uncertain.
There are inherent biases, such as the tendency of microarray data sets to con-
sist of samples from wealthy countries, and from people who are known to have
genetically linked disease, with only limited numbers of controls (healthy cases).
The incidence of a particular condition, such as a rare cancer, may be very low,
and some machine learning techniques are poor with highly imbalanced samples.
Knowledge of the extent of such confounds in the data is needed to help assess the
significance of the results, and to then, as far as possible, rectify the data. This may
involve, for example, careful manual data processing, following explicit guide-
lines. It is essential that such manipulation doesn't in some way bias the results
towards the method that is being explored—thus avoiding situations that can be
parodied as “our method is poor in the presence of outliers and inconsistency, so
we removed the problematic data”.
Results : With a good understanding of the data that is to be used, the researcher
should be able to make some predictions about the results, or about their form.
On an assumption that the method works, the researcher could perhaps estimate
the likelihood that a particular level of statistical significance is observed for a
particular strength of linkage; or could estimate the size of the smallest data set
in which the linkage would be detected.
Some experiments depend on human annotation of data, to provide a gold stan-
dard or ground truth . In document classification, for example, human annotation
may indicate the topic of a document: politics, entertainment, sport, and so on. For
microarray data, this annotation is available in a natural way, as the characteristics of
each profiled human are part of the data set. In other cases, gathering of annotations
can be the dominant cost of an experiment—a factor that is often overlooked, or
underestimated, by inexperienced researchers.
In the process of developing new algorithms, researchers typically use as a testbed
a data set with which they are familiar. If the algorithm is parameterized in some
way—by buffer size, say—this testbed can be used for tuning, that is, to identify the
parameter values that give the best performance. What this tuning process almost
Search WWH ::




Custom Search