Information Technology Reference
In-Depth Information
of the twenty amino acids is one alternative for digital representation. This is
however, not a meaningful representation for signal processing. We will review
below the approaches by which signal processing techniques become applica-
ble to the identification of meaningful building blocks in protein sequences. We
will demonstrate in detail using this example how scientific and technological
advances in the specialized area of automatic speech recognition become rele-
vant for the specialized area within computational biology of protein secondary
structure prediction. Both areas separately have been extensively researched for
several decades; the complete solution has not been accomplished; in both cases
the underlying principles are understood, yet are dicult to model for decoding
by a computer practically, “as the physics of simplicity and complexity meet”
[26]. For a deeper understanding of protein structure and protein biochemistry,
see [27] and [28]. Readers interested further in speech recognition may refer to
[29, 30].
3.1
Digital Representation
Speech waveform is a superposition of signals of various different frequencies.
By way of Nyquist criterion, the information in the signal can be completely
captured by sampling the signal at a rate that is at least twice that of the largest
frequency in the signal. Since most information in human speech is band-limited
to about 8 kHz, sampling it at a rate of 16 kHz is sucient. A typical digitized
speech signal is a series of discrete-time samples of its amplitude. The amplitude
of each sample is further coded into discrete levels to allow digital representation.
To apply signal processing techniques to protein sequences, the protein must be
represented by some numerical representation of its property at each position.
To derive a meaningful representation of the protein signal, we must understand
the chemical structures of the amino acids and their resulting physico-chemical
properties (see above). The scales relating the 20 amino acids to each other based
on these properties can be used to replace the amino acid symbols with numeric
representations more similar to speech waveforms. In principle, any one of the
property scales can be used, depending on the type of protein sequence analysis
required. Consider the example speech utterance, “how to recognize speech with
this new display”, whose waveform is shown in Fig. 6A. The signal has been
sampled at 16 kHz. Typically, the signal also contains background noise and
therefore the pauses in between words are not entirely flat. The waveform shows
how the amplitude of the sound varies as time progresses from the beginning
of the utterance to the end. In contrast consider a protein. Figures 6B and 6C
show how a protein may be represented as numerical signals. Figure 6B shows
the protein in terms of charge and Fig. 6C shows the protein represented in
terms of hydrophobicity of the amino acids. While speech is represented with
respect to time, protein is represented in physical dimension from one end to the
other end of the amino acid chain.
The goal of speech recognition is to identify the words that are spoken. There
are several hundred thousand words in a typical language. These words are
formed by a combination of smaller units of sound called phones. Recognizing
Search WWH ::




Custom Search