Information Technology Reference
In-Depth Information
to analyze the profiles to see if there is a clear link from specific genetic variations
to specific aspects of the phenome, for example to try and identify variations that are
linked to occurrences of a particular cancer.
In this example, microarray-based linkage analysis is an application for which
a researcher is developing computational methods. Because the linking process is
uncertain (the data is unreliable and incomplete, genetics is imperfectly understood,
and so on), there is scope for a new method that improves on existing approaches.
Validation of this method will require data, and simulated or artificial data is unlikely
to be persuasive, because the accuracy of simulation would depend on complex
assumptions about factors such as laboratory conditions, biases in the sample of
humans chosen for profiling, microarray error rates, and distributions of individual
genetic variations.
Considering the list of desirable characteristics given above, then, the researcher
might give the following responses.
What data : The researcher could profile some individuals directly, but it is much
cheaper to obtain a collection of profiles from a public genomic database. Such
profiles will often be associated with previous publications, so their characteristics
should be well understood.
What mechanisms : To choose specific data sets, a good approach in this case could
be to find research papers that draw biomedical conclusions based on particular
data of the right kind, and then obtain that data. The data may then need to be nor-
malized or cleaned up in some defensible way: if the data was originally gathered
for other purposes, some of it may not be suitable for the current investigation; or
it may be in an inconsistent format; or it may contain known outliers that could
reasonably be removed by hand; or, if derived frommultiple, inconsistent sources,
it may need to be unified by separate preprocessing of each of the components.
Sufficiency of data : There are several respects in data volume is relevant. One is
algorithmic: methods tend to behave differently at different scales. Performance
(in terms of processing time) for a data set that fits in CPU cache is unlikely to
be informative for data that requires hard disk or networked access, for example.
Some methods simply don't scale, as unanticipated costs become dominant. A
slightly more subtle challenge of scale is that larger data sets offer different statis-
tical properties. Finding a matching image from amongst a hundred hand-chosen
candidates may be much easier than from amongst a million that were chosen at
random; while the smoother distributions of a large data set may, for example,
simplify the problem of finding similar documents—or, as in this case, detecting
genetic linkages.
Another respect in which volume is relevant is statistic significance; data volumes
need to be large enough to ensure that the experiment will be able to detect the
effect that is being hypothesised.
Sufficiency also has another dimension—the number of data sets being used. A
single data set may not be persuasive, particularly if the reader suspects that the
method was tuned to perform well on the data set reported in the paper.
Search WWH ::




Custom Search