Computational Biology and Language - Ambient Intelligence for Scientific Discovery

Information Technology Reference

In-Depth Information

features do not encode the types of amino acid substitutions typical for protein

sequences exemplified in the motif example above. Therefore, it was investigated

if a compact representation of position specific n-grams as

x{−|

+

}N

,where

x

is the n-gram,

{−|

+

}

indicates whether it occurs before or after the residue un-

der question, and

is the distance from this residue to the n-gram, may be a

better representation of the protein sequence. The analogy to language can be

found when classifying documents into possible topics. This task also requires

identification of crucial words that can discriminate between possible topics. For

example, the word 'ball' can discriminate between “science” and “sports” topics

but cannot distinguish between “cricket” and “football” topics. Advances in topic

detection methods for text documents have resulted in some reliable methods to

identify such discriminating words. In the context of protein secondary structure

prediction, there are also position specific propensities of amino acids in different

secondary structure types and therefore topic detection algorithms are directly

applicable to secondary structure prediction at the residue level. The application

of the context-sensitive vocabulary provided results that are comparable to the

current state-of-the-art methods using “black-box” classification approaches, in

particular neural networks, with

N

Q 3 accuracy of about 70%. The advantage of

the use of the context-sensitive vocabulary over these “black-box” methods is

that it allows analysis of the word-association matrix with singular value decom-

position to identify co-occurring word pairs, corresponding to regular expressions

with a specific “meaning” for secondary structure. For example, one of the most

highly associated word pair corresponded to the pattern “CPxxAI”. The pat-

tern describes the loop region at the C-terminal end of a beta-sheet. Thus, the

context-sensitive vocabulary encodes some of the complex dependencies between

amino acids that determine formation of secondary structure.

3

Identification of Functional Building Blocks in Proteins

as a Signal Processing Task

The lack of knowledge on what are the break points separating words from each

other is not new to the language arena. In fact, it is found in many speech ap-

plications. In a spoken sentence, words are not separated from each other by

spaces as in written text. Thus, automatic speech analysis and synthesis meth-

ods also have to deal with identification of meaningful units. The task therefore

shifts from statistical analysis of word frequencies to a stronger focus on signal

identification and differentiation from noise in speech recognition applications.

Similarly, the task of mapping protein sequences to their structure, dynamics

and function can also be seen more generally as a signal processing task. Just

as the speech signal is a waveform whose acoustical features vary with time,

a protein is a linear chain of chemico-physical features that vary with position

in the sequence. However, while a speech sample can take unlimited contin-

uous values, or digitized values within a given digital resolution, for proteins

the value can be only one of the possible twenty, corresponding to the twenty

types of amino acids (see above). Hence assigning a symbol or value to each

Ambient Intelligence for Scientific Discovery

Search WWH ::

Custom Search

Home