Information Technology Reference
In-Depth Information
law [45-51]. Zipf's law states that the frequency of a word is related to its rank
by a power law [52, 53]. While there is some debate as to the meaning of this
observation for biological sequences [45-51], the Zipf plot of n-gram frequencies
has found application in identification of genome signatures [16].
The Zipf-like analysis of protein sequences allows addressing the question of
whether the sequences in proteins of different organisms are statistically similar
or if organisms may be viewed as representations of different languages. We
compared the n-gram frequencies of 44 different organisms using the n-gram
comparison functions provided by the Biological Language Modeling Toolkit.
(1) A simple Markovian uni-gram (context independent amino acid model from
the proteins of Aeropyrum pernix was trained. When training and test sets were
from the same organism, a perplexity (a variation of cross-entropy) of 16.6 was
observed, whereas data from other organisms varied from 16.8 to 21.9. Thus, even
the simplest model can automatically detect the differences in amino acid usage
of different organisms. (2) We developed a modification of Zipf-like analysis that
can reveal specific differences in n-grams in different organisms. First, the amino
acid n-grams of a given length were sorted in descending order by frequency for
the organism of choice. An example is shown in Fig. 12 for Neisseria meningitidis
for n=4. Remarkably, there are three n-grams (shown by red arrows in the figure)
that are among the top 20 most frequently occurring 4-grams in Neisseria, but
that are rare or absent in any of the other genomes.
These highly idiosyncratic n-grams suggest “phrases” that are preferably
used in the particular organism. These phrases are highly statistically signifi-
cant, not only across organisms, but also within Neisseria itself. In particular,
the 4-grams SDGI and MPSE are highly over-represented as compared to the
frequencies expected based on the uni-gram distributions in Neisseria [16]. (3)
While it is not known if these “phrases” correspond to similar or different sub-
structures of proteins, we found that amino acid neighbor preferences are also
different for different organisms, suggesting the possibility for underlying subtle
changes in the mapping of sequences to structures of proteins.
4.2
Protein Family Classification
Another important task in computational biology is protein family classifica-
tion. G-protein coupled receptors (GPCRs) are a superfamily of proteins and
particularly di cult to classify into families due to the extreme diversity among
its members. A comparison of BLAST, k-NN, HMM and SVM with alignment-
based features has suggested that classifiers at the complexity of SVM are needed
to attain high accuracy [54]. However, we were able to show that the simple De-
cision Tree and Na ıve Bayes classifiers in conjunction with chi-square feature
selection on counts of n-grams perform extremely well, and the Na ıve Bayes
classifier even outperforms the SVM significantly [55]. We also generalized the
utility of n-grams for high-accuracy classification of other protein families using
the Na ıve Bayes approach [55, 56]. In line with these observations, Wu and co-
workers have observed that neural networks perform well with n-gram features
in the protein family classification task [57].
Search WWH ::




Custom Search