Information Technology Reference
In-Depth Information
5 Biological Language Modeling Toolkit and Website
A large number of linguistic methods for protein sequence analysis are provided
at http://flan.blm.cs.cmu.edu/ .
6 Conclusions
Here, we have shown that the use of an intuitive analogy allows direct applica-
tion of methods developed in one specialized area of research to that of another.
In particular, we demonstrated the use of language and speech technologies for
a variety of computational biology problems. We described the major hurdle in
the use of this analogy, the identification of functional equivalents of “words”
in protein sequences with the long-term goal of preparing a dictionary for “pro-
tein sequence language”. Although we are far from building such a dictionary,
we demonstrate that a number of different vocabularies can provide meaningful
building blocks in protein sequences. The utility of these vocabularies depends on
the specific type of application in computational biology, and we provided exam-
ples from secondary structure prediction of soluble and of membrane proteins, of
motif identification in genomes, protein family classification and protein folding
and tertiary structure. Vocabularies range from individual chemical groups, to
single amino acids, to combinations of amino acids (n-grams) with and without
context information to chemical property representation. Automatic identifica-
tion of functional building blocks using speech recognition and topic boundary
detection methods both independently identified secondary structure elements
as major functional building blocks of protein sequences.
Acknowledgements
Research presented here was funded in part by NSF ITR grants EIA0225656 and
EIA0225636 and the Sofya Kovalevkaya Award from the Humboldt Foundation /
Zukunftsinvestitionsprogramm der Bundesregierung Deutschland and NIH grant
NLM108730.
References
1. Searls, DB: “The Language of Genes” Nature. volume 420. issue 6912. (2002) 211-7
2. Bolshoy, A: “DNA Sequence Analysis Linguistic Tools: Contrast Vocabularies,
Compositional Spectra and Linguistic Complexity.” Appl Bioinformatics. volume
2. issue 2. (2003) 103-12
3. Biological Language Modeling Project: http://www.cs.cmu.edu/˜blmt/
4. Huang, CC and Couch, GS and Pettersen, EF and Ferrin, TE: “Chimera: An
Extensible Molecular Modeling Application Constructed Using Standard Com-
ponents” http://www.cgl.ucsf.edu/chimera. PSB1996: Pacific Symposium on Bio-
computing. (1996) 50-61
5. Baldi, P: Bioinformatics. MIT Press. (1998)
Search WWH ::




Custom Search