Information Technology Reference
In-Depth Information
(3) Finally, an n-gram language modeling approach has also been adopted.
The method builds a language model for each 'topic' representing transmem-
brane helices and loops and compares their performance in predicting the cur-
rent amino acid, to determine whether a boundary occurs at the current position.
The language models make use of only n-grams probabilities, but surprisingly
still produced promising results [14, 18, 19].
4 Other Applications of Language Technologies
in Computational Biology of Proteins
4.1
Genome Comparison
The secondary structure and transmembrane helix prediction tasks are only two
examples of many tasks in computational biology where language technologies
are relevant. For example, the features most often used in language technologies
are word n-grams and there are many other tasks where n-grams do form mean-
ingful building blocks. Probably the most widely known application of n-grams in
computational biology is their use in the BLAST algorithm, where they enhance
computational eciency in sequence searching in the initial step [44]. However,
n-grams have also proven useful in a number of other bioinformatics areas. The
distributions of n-grams in biological sequences have been shown to follow Zipf's
Fig. 12. Distribution of amino acid n-grams with n=4 in Neisseria meningitidis in com-
parison to the distribution of the corresponding amino acids in 44 other organisms [61].
N-grams of Neisseria are plotted in descending order of their frequency in the genome
(in bold red). Numbers on x- indicate the ranks of the specific n-grams in Neisseria.
Frequencies of corresponding n-grams from genomes of various other organisms are also
shown (thin lines). The second thin line closely following the bold red line corresponds
to a different strain of Neisseria meningtidis. Arrows indicate the positions of 4-grams
that are over-represented in Neisseria, but are rare in other genomes. The figure is
reproduced from [61] with permission from the publisher.
Search WWH ::




Custom Search