Information Technology Reference
In-Depth Information
proaches, for example in document classification and topic detection. Tradition-
ally, these methods have relied on n-gram features and statistical associations
between them. We have complemented the above study with n-gram approaches
to address the membrane protein boundary prediction problem.
(1) Similar to topic segmentation in natural language, we applied Yule's
measure of association [42] to this problem based on its use in natural language
processing [43]. Given a text with n different words, an n x n table of Yule
values for every pair of words is computed. The distribution of Yule values in
the table differs for different categories of text, indicating the positions of the
boundaries. In a model application to the G-Protein Coupled Receptor (GPCR)
family of membrane proteins, we found that Yule values can differentiate between
transmembrane helices and loops connecting the helices [18].
(2) Using n-gram features but a different association measure, Mutual In-
formation, it was also shown that language technologies can discover known
functional building blocks, the transmembrane helices, without prior assump-
tion on the length, type or properties of these building blocks. While the above
Yule statistics required prior knowledge in the form of a training set for exam-
ples of transmembrane versus non-transmembrane applications, using mutual
information, no such knowledge was required. Computing Mutual Information
statistics on the entire dataset of a membrane protein family, the GPCR fam-
ily, without prior knowledge on the positions of extracellular-transmembrane
and cytoplasmic-transmembrane boundaries, can rediscover these boundaries,
as shown in Fig. 11 [19]. In topic segmentation, topic boundaries are indi-
cated by minima in Mutual Information. Similarly, in membrane proteins se-
quences, both membrane-cytoplasmic and membrane-extracellular boundaries
are detected with high accuracy [19].
Fig. 11. Mutual information values along the rhodopsin sequence using different
datasets GPCR to generate mutual information values [19]. Horizontal lines use the
same color code as in Figure 1 indicating the positions of the segments belonging to
each of extracellular, cytoplasmic and helices domains based on expert knowledge. The
positions of breakpoints indicated by mutual information minima are shown as blue
labels. The figure is JKS's version of the work. It is posted here by permission of ACM
for your personal use. Not for redistribution. The definitive version was published in
[19] http://doi.acm.org/10.1145/967900.967933.
Search WWH ::




Custom Search