Information Technology Reference
In-Depth Information
with another strand. Two or more such interacting strands form what is called
a sheet. A turn is defined as a short segment that causes the protein to bend.
Loop or coil region has no specific secondary structure. Commonly, the 7 groups
are reduced to 3 groups, helix, strand and loop (shown in an example in Fig. 5).
To study the relevance of different vocabularies for secondary structure forma-
tion, we used the following vocabularies: (1) chemical building blocks of amino
acids, (2) single amino acids from the 20 amino acid alphabet and (3) reduced
alphabets based on similarities between physico-chemical properties of amino
acids [17]. Latent Semantic Analysis (LSA) was used to decipher the role of the
vocabulary for this task, because it is a natural language processing method
that is used to extract hidden relations between words [22]. We should there-
fore be able to study the effects of different vocabularies on secondary structure
without introducing artifacts through the differences in size and geometry in
the different units studied. LSA captures semantic relations using global infor-
mation extracted from a large number of documents and can therefore identify
words in a text that are synonymous even when such information is not directly
available. LSA was then applied to characterize segments of protein sequences
with a given type of secondary structure, helix, strand or loop. Each segment
was represented as a bag-of-words vector traditionally used in document pro-
cessing. The word-document matrix comprising all the protein segment vectors
was transformed into Eigenspace through singular value decomposition, and the
protein segments were compared to each other in terms of vector representation
in singular space. To compare the usefulness of this representation, protein seg-
ments were separated into training and test sets and the secondary structure of
each segment in the test set was predicted based on the secondary structure of
its nearest neighbors in the singular space from among the training set. When
representing the amino acid sequences using the three different vocabularies, we
observed that different vocabularies are better at characterizing different struc-
ture types. Helices and strands are best characterized using amino acid types
with LSA, and coils are characterized better with amino acids as vocabulary and
using the simple word-document matrix analysis (called VSM [23]) without LSA.
Average 3-class prediction (
Q 3 ) was found to be best using chemical groups as
vocabulary and using VSM. The results demonstrate that word-document matrix
analysis and LSA capture sequence preferences in structural types and can dis-
tinguish between the “meanings” of vocabularies for protein secondary structure
types. Furthermore, protein sequences represented in terms of chemical groups
and amino acid types provide more clues on structure than the classically used
amino acids as building blocks [17].
As shown by the above study [17] and many previous studies [24], single
amino acid propensities have limited ability to predict secondary structure ele-
ments. It was therefore investigated if larger segments composed of several amino
acids, so-called k-mers or n-grams of amino acids are more appropriate units of
protein sequence language with respect to their meaning for secondary structure
[25]. However, this study found that n-grams do not capture secondary struc-
ture propensity of protein segments well. This is due to the fact that n-gram
Search WWH ::




Custom Search