Computational Biology and Language - Ambient Intelligence for Scientific Discovery

Information Technology Reference

In-Depth Information

of the twenty amino acids is one alternative for digital representation. This is

however, not a meaningful representation for signal processing. We will review

below the approaches by which signal processing techniques become applica-

ble to the identification of meaningful building blocks in protein sequences. We

will demonstrate in detail using this example how scientific and technological

advances in the specialized area of automatic speech recognition become rele-

vant for the specialized area within computational biology of protein secondary

structure prediction. Both areas separately have been extensively researched for

several decades; the complete solution has not been accomplished; in both cases

the underlying principles are understood, yet are dicult to model for decoding

by a computer practically, “as the physics of simplicity and complexity meet”

[26]. For a deeper understanding of protein structure and protein biochemistry,

see [27] and [28]. Readers interested further in speech recognition may refer to

[29, 30].

3.1

Digital Representation

Speech waveform is a superposition of signals of various different frequencies.

By way of Nyquist criterion, the information in the signal can be completely

captured by sampling the signal at a rate that is at least twice that of the largest

frequency in the signal. Since most information in human speech is band-limited

to about 8 kHz, sampling it at a rate of 16 kHz is sucient. A typical digitized

speech signal is a series of discrete-time samples of its amplitude. The amplitude

of each sample is further coded into discrete levels to allow digital representation.

To apply signal processing techniques to protein sequences, the protein must be

represented by some numerical representation of its property at each position.

To derive a meaningful representation of the protein signal, we must understand

the chemical structures of the amino acids and their resulting physico-chemical

properties (see above). The scales relating the 20 amino acids to each other based

on these properties can be used to replace the amino acid symbols with numeric

representations more similar to speech waveforms. In principle, any one of the

property scales can be used, depending on the type of protein sequence analysis

required. Consider the example speech utterance, “how to recognize speech with

this new display”, whose waveform is shown in Fig. 6A. The signal has been

sampled at 16 kHz. Typically, the signal also contains background noise and

therefore the pauses in between words are not entirely flat. The waveform shows

how the amplitude of the sound varies as time progresses from the beginning

of the utterance to the end. In contrast consider a protein. Figures 6B and 6C

show how a protein may be represented as numerical signals. Figure 6B shows

the protein in terms of charge and Fig. 6C shows the protein represented in

terms of hydrophobicity of the amino acids. While speech is represented with

respect to time, protein is represented in physical dimension from one end to the

other end of the amino acid chain.

The goal of speech recognition is to identify the words that are spoken. There

are several hundred thousand words in a typical language. These words are

formed by a combination of smaller units of sound called phones. Recognizing

Ambient Intelligence for Scientific Discovery

Search WWH ::

Custom Search

Home