Information Technology Reference
In-Depth Information
features do not encode the types of amino acid substitutions typical for protein
sequences exemplified in the motif example above. Therefore, it was investigated
if a compact representation of position specific n-grams as
x{−|
+
}N
,where
x
is the n-gram,
{−|
+
}
indicates whether it occurs before or after the residue un-
der question, and
is the distance from this residue to the n-gram, may be a
better representation of the protein sequence. The analogy to language can be
found when classifying documents into possible topics. This task also requires
identification of crucial words that can discriminate between possible topics. For
example, the word 'ball' can discriminate between “science” and “sports” topics
but cannot distinguish between “cricket” and “football” topics. Advances in topic
detection methods for text documents have resulted in some reliable methods to
identify such discriminating words. In the context of protein secondary structure
prediction, there are also position specific propensities of amino acids in different
secondary structure types and therefore topic detection algorithms are directly
applicable to secondary structure prediction at the residue level. The application
of the context-sensitive vocabulary provided results that are comparable to the
current state-of-the-art methods using “black-box” classification approaches, in
particular neural networks, with
N
Q 3 accuracy of about 70%. The advantage of
the use of the context-sensitive vocabulary over these “black-box” methods is
that it allows analysis of the word-association matrix with singular value decom-
position to identify co-occurring word pairs, corresponding to regular expressions
with a specific “meaning” for secondary structure. For example, one of the most
highly associated word pair corresponded to the pattern “CPxxAI”. The pat-
tern describes the loop region at the C-terminal end of a beta-sheet. Thus, the
context-sensitive vocabulary encodes some of the complex dependencies between
amino acids that determine formation of secondary structure.
3
Identification of Functional Building Blocks in Proteins
as a Signal Processing Task
The lack of knowledge on what are the break points separating words from each
other is not new to the language arena. In fact, it is found in many speech ap-
plications. In a spoken sentence, words are not separated from each other by
spaces as in written text. Thus, automatic speech analysis and synthesis meth-
ods also have to deal with identification of meaningful units. The task therefore
shifts from statistical analysis of word frequencies to a stronger focus on signal
identification and differentiation from noise in speech recognition applications.
Similarly, the task of mapping protein sequences to their structure, dynamics
and function can also be seen more generally as a signal processing task. Just
as the speech signal is a waveform whose acoustical features vary with time,
a protein is a linear chain of chemico-physical features that vary with position
in the sequence. However, while a speech sample can take unlimited contin-
uous values, or digitized values within a given digital resolution, for proteins
the value can be only one of the possible twenty, corresponding to the twenty
types of amino acids (see above). Hence assigning a symbol or value to each
Search WWH ::




Custom Search