Experimentation - Writing for Computer Science

Information Technology Reference

In-Depth Information

to analyze the profiles to see if there is a clear link from specific genetic variations

to specific aspects of the phenome, for example to try and identify variations that are

linked to occurrences of a particular cancer.

In this example, microarray-based linkage analysis is an application for which

a researcher is developing computational methods. Because the linking process is

uncertain (the data is unreliable and incomplete, genetics is imperfectly understood,

and so on), there is scope for a new method that improves on existing approaches.

Validation of this method will require data, and simulated or artificial data is unlikely

to be persuasive, because the accuracy of simulation would depend on complex

assumptions about factors such as laboratory conditions, biases in the sample of

humans chosen for profiling, microarray error rates, and distributions of individual

genetic variations.

Considering the list of desirable characteristics given above, then, the researcher

might give the following responses.

What data : The researcher could profile some individuals directly, but it is much

cheaper to obtain a collection of profiles from a public genomic database. Such

profiles will often be associated with previous publications, so their characteristics

should be well understood.

What mechanisms : To choose specific data sets, a good approach in this case could

be to find research papers that draw biomedical conclusions based on particular

data of the right kind, and then obtain that data. The data may then need to be nor-

malized or cleaned up in some defensible way: if the data was originally gathered

for other purposes, some of it may not be suitable for the current investigation; or

it may be in an inconsistent format; or it may contain known outliers that could

reasonably be removed by hand; or, if derived frommultiple, inconsistent sources,

it may need to be unified by separate preprocessing of each of the components.

Sufficiency of data : There are several respects in data volume is relevant. One is

algorithmic: methods tend to behave differently at different scales. Performance

(in terms of processing time) for a data set that fits in CPU cache is unlikely to

be informative for data that requires hard disk or networked access, for example.

Some methods simply don't scale, as unanticipated costs become dominant. A

slightly more subtle challenge of scale is that larger data sets offer different statis-

tical properties. Finding a matching image from amongst a hundred hand-chosen

candidates may be much easier than from amongst a million that were chosen at

random; while the smoother distributions of a large data set may, for example,

simplify the problem of finding similar documents—or, as in this case, detecting

genetic linkages.

Another respect in which volume is relevant is statistic significance; data volumes

need to be large enough to ensure that the experiment will be able to detect the

effect that is being hypothesised.

Sufficiency also has another dimension—the number of data sets being used. A

single data set may not be persuasive, particularly if the reader suspects that the

method was tuned to perform well on the data set reported in the paper.

Writing for Computer Science

Search WWH ::

Custom Search

Home