Information Technology Reference
In-Depth Information
Concepts of Similarity in Bioinformatics
Vilmos ÁGOSTON 1 , László KAJÁN 2 , Oliviero CARUGO 2,3 , Zoltán HEGEDÜS 1 , Kristian
VLAHOVICEK 2,4 and Sándor PONGOR 2
1
Bioinformatics Group, Biological Research Center, Hungarian Academy of Sciences,
Temesvári krt. 62, 6726 Szeged, Hungary
2
Protein Structure and Bioinformatics Group, International Centre for Genetic Engineering
and Biotechnology, Area Science Park, 34012 Trieste, Italy
3 Department of General Chemistry, Pavia University, viale Taramelli 12, 27100 Pavia,
Italy
4 Molecular Biology Department, Biology Division, Faculty of Science, University of
Zagreb, 10000 Zagreb, Croatia
Abstract. The key problem of bioinformatics is the prediction of properties, such as
structure or function, based on similarity This chapter reviews the concepts and tools
of similarity analysis used in various fields of bioinformatics.
Introduction
The concept of similarity is fundamental in the study of macromolecular structures,
genomes, proteomes and metabolic pathways. Similar objects are often assumed to take
part in similar mechanism, or to carry out a similar function. Similarity, on the other hand is
a highly intuitive concept, and its use in various fields - such as the comparison of
sequences or of 3-D structures - is quite different. For students of molecular biology it is
sometimes difficult to find straightforward definitions of the basic concepts that originate
from as diverse fields as cognitive psychology, systems science as well as various branches
of mathematics. The motivation of this review is to provide a - not necessarily complete -
compendium of useful concepts and definitions and to show the commonalities underlying
the various applications. We will use three main forms of representations: sequences, 3-D
structures and graphs. The discussion will be based on an entity-relationship description of
macromolecular structures [1], as applied to the description of small molecules [2] as well
as biological objects used in genome analysis [3].
Most concepts of molecular similarity have been proposed in applied contexts that
are so numerous that an exhaustive coverage would detract from our focus on the
underlying mathematical spaces. In particular, machine learning methodologies used in
bioinformatics [4, 5], such as neural networks [6] and support vector machines [7] are
based on specific concepts that in our view cannot be adequately described in the
framework of a general discussion. Similarly, we could not include a practice-oriented
overview of applications such as the comparison of sequences, 3D structures and genomes
(a review on these topics will be published elsewhere [8]). Several fields that are gaining
importance in bioinformatics, such as the analysis text similarities [9], could not be
incorporated because of space limitations. Although a significant amount of research is thus
excluded from this overview, a broad, and we hope to show, integrated body of research
remains.
The primary focus of this work is to present a set of useful definitions pertinent to
the similarity analysis of macromolecular structures, meant as reference material for
Search WWH ::




Custom Search