The Central Dogma - Bioinformatics Computing

Biomedical Engineering Reference

In-Depth Information

As the number of transistors in a microcomputer and the number of lines of code in operating

systems climb into the millions, there is the challenge of dealing with increasing complexity.

Complexity theory explains how extremely small errors in initial conditions within complex

systems—a single mistake in a million-line piece of code, for example—can grow to influence

behavior on a much larger scale. It's no surprise that PCs occasionally fail or crash because of

"memory leaks" and other non-specific reflections of a system characterized by complexity.

Sometimes, the results are more insidious, such as the math errors caused by a defect in Intel's

original Pentium chip.

Fortunately, technologies have been developed in an attempt to resolve potential problems before

they surface. For example, decision tables—matrices of possible input and output states—can help

identify combinations of input conditions that should be tried when testing a microprocessor. When

the number of possible input conditions rises to the hundreds, decision tables and other state-

validation tools make an otherwise impossible task doable.

Archiving

As illustrated by the gene sequencing machine, the end result of processing the DNA fragments is

volumes of data that must be stored for a variety of uses. For example, the sequence data can be

compared with other investigators' data to look for inconsistencies or validation. The data can be

processed locally in order to visualize the most likely protein structures that would result from

translation of the nucleotide sequences. In addition, the data can be submitted to one of the national

databases to support the work of other microbiologists or to give the researcher academic credit for

the electronic publication. As such, a reason for creating biological databases is to support the

analysis and communication of data, information, and metadata relevant to molecular biologists. In

many respects, the functions of archiving, processing, and communications overlap significantly.

Just as the transfer of data from DNA to RNA to protein relies on an information infrastructure, data

archives rely on an information technology (IT) infrastructure. This IT infrastructure includes network

and database technologies as well as standard vocabularies to store and access information. Even

though sequencing and other molecular biology data is vast and growing daily, there are huge gaps

in our understanding of the relationships of databases to each other and with higher-level disease

databases. One of the motivations for constructing archives and linking them together is so that this

gap can be closed as quickly as possible.

For the molecular biologist involved in developing or using databases, it's important to consider the

processes involved in managing data before focusing on the technology. That is, the process of data

collection, use, and dissemination should drive technology. After all, Mendel's notebooks didn't

dictate his experiments with garden-variety peas, but they empowered him by leveraging his

capacity to recall previous experiments, to plan for future experiments, and publish his findings.

Numerical Processing

Computers are recognized foremost for their computational or numerical-processing capabilities. In

bioinformatics, applications for numerical-processing techniques range from sequence analysis,

microarray data analysis, and site prediction to gene finding, protein structure prediction, and

phylogenetic analysis. These applications in turn rely on methods ranging from pattern matching,

simulation, and data mining to machine learning, statistics, cluster analysis, and decision trees. For

example, consider the pattern-matching challenge associated with multiple string alignment—aligning

multiple polypeptide sequences—as a means of discovering potential homologous relationships

between proteins. Because millions of calculations may be involved in examining three or four

relatively short sequences, the much more formidable task of matching multiple sequences of several

hundred polypeptides in length is usually computationally prohibitive on even the fastest desktop

hardware.

In numerical-processing applications such as pattern matching, speed of computation is valued over

all else. As every computer hardware manufacturer knows, speed sells. A PC or workstation that was

Bioinformatics Computing

Search WWH ::

Custom Search

Home