Computational Biology and Language - Ambient Intelligence for Scientific Discovery

Information Technology Reference

In-Depth Information

law [45-51]. Zipf's law states that the frequency of a word is related to its rank

by a power law [52, 53]. While there is some debate as to the meaning of this

observation for biological sequences [45-51], the Zipf plot of n-gram frequencies

has found application in identification of genome signatures [16].

The Zipf-like analysis of protein sequences allows addressing the question of

whether the sequences in proteins of different organisms are statistically similar

or if organisms may be viewed as representations of different languages. We

compared the n-gram frequencies of 44 different organisms using the n-gram

comparison functions provided by the Biological Language Modeling Toolkit.

(1) A simple Markovian uni-gram (context independent amino acid model from

the proteins of Aeropyrum pernix was trained. When training and test sets were

from the same organism, a perplexity (a variation of cross-entropy) of 16.6 was

observed, whereas data from other organisms varied from 16.8 to 21.9. Thus, even

the simplest model can automatically detect the differences in amino acid usage

of different organisms. (2) We developed a modification of Zipf-like analysis that

can reveal specific differences in n-grams in different organisms. First, the amino

acid n-grams of a given length were sorted in descending order by frequency for

the organism of choice. An example is shown in Fig. 12 for Neisseria meningitidis

for n=4. Remarkably, there are three n-grams (shown by red arrows in the figure)

that are among the top 20 most frequently occurring 4-grams in Neisseria, but

that are rare or absent in any of the other genomes.

These highly idiosyncratic n-grams suggest “phrases” that are preferably

used in the particular organism. These phrases are highly statistically signifi-

cant, not only across organisms, but also within Neisseria itself. In particular,

the 4-grams SDGI and MPSE are highly over-represented as compared to the

frequencies expected based on the uni-gram distributions in Neisseria [16]. (3)

While it is not known if these “phrases” correspond to similar or different sub-

structures of proteins, we found that amino acid neighbor preferences are also

different for different organisms, suggesting the possibility for underlying subtle

changes in the mapping of sequences to structures of proteins.

4.2

Protein Family Classification

Another important task in computational biology is protein family classifica-

tion. G-protein coupled receptors (GPCRs) are a superfamily of proteins and

particularly di cult to classify into families due to the extreme diversity among

its members. A comparison of BLAST, k-NN, HMM and SVM with alignment-

based features has suggested that classifiers at the complexity of SVM are needed

to attain high accuracy [54]. However, we were able to show that the simple De-

cision Tree and Na ıve Bayes classifiers in conjunction with chi-square feature

selection on counts of n-grams perform extremely well, and the Na ıve Bayes

classifier even outperforms the SVM significantly [55]. We also generalized the

utility of n-grams for high-accuracy classification of other protein families using

the Na ıve Bayes approach [55, 56]. In line with these observations, Wu and co-

workers have observed that neural networks perform well with n-gram features

in the protein family classification task [57].

Ambient Intelligence for Scientific Discovery

Search WWH ::

Custom Search

Home