Making Knowledge - Life Out of Sequence

Biology Reference

In-Depth Information

could fi nd the hidden properties of molecules by doing some

sort of biochemical assay. 29

Most bioinformatic problems have this character—there are too many

data points, too many individual cases, to check one by one; only the

computer has the time and the power to process them. But how do you

decide when something is “real”? How do you decide whether the com-

puter is producing a biological result or an artifact? How do you trust

the machine?

When I posed this question to bioinformaticians, or when I observed

their work in the lab, it seemed that they were using the computer itself

to reassure themselves that it was producing “biology” rather than gar-

bage. This process was often described as a “sanity check”—tests were

specially designed to check that the computer was not taking the data

and doing something that made no sense. Bioinformaticians consider

these tests to be the computational equivalent of experimental con-

trols. In a traditional wet lab, a control is usually a well-characterized

standard—if you are measuring the length of a sequence on a gel, the

control might be a sequence of known length. If the experiment does

not reproduce the length of the control accurately, then it is unlikely

that it is measuring the lengths of unknown samples accurately either. A

computational sanity check could be described in the same terms: run

the computer program on a well-characterized data set for which the

result is known to check that it is giving sensible results. For example,

sometimes it is possible to use the computer to construct an alternative

data set (often using random elements) that, when run through a given

piece of software, should produce a predictable outcome. Just as with

my project, a large part of the work of doing bioinformatics revolves

around the construction of appropriate controls or sanity checks. This is

not a trivial task—the choice of how to construct such tests determines

the faith one can put in the fi nal results. In other words, the quality and

thoroughness of the sanity checks are what ultimately give the work its

plausibility (or lack thereof).

Such sanity checks are actually simulations. They are (stochastic)

models of biological systems, implemented on a computer. Bioinforma-

ticians are aware that there is much about both their data and their

tools that they cannot know. As for the data, there are simply too many

to be able to examine them all by eye. As for their tools, they necessarily

remain in ignorance about the details of a piece of software, either be-

cause it was written by someone else or because it is suffi ciently compli-

cated that it may not behave as designed or expected. In both cases, it is

Search WWH ::

Custom Search

Home