Information Technology Reference
In-Depth Information
classes, each containing members of a protein family. It is assumed that a sequence
similarity measure is also defined on the set of sequences. Another similarity space used for
proteins consists of the structures or protein folds as descriptors, a set of equivalence
classes, each containing members of a specific fold group. A distance function, such as
rmsd is defined on the set of fold structures.
The co-existence of a priori known (biologically relevant) classification schemes
and computable proximity measures is characteristic of the similarity spaces studied by
bioinformatics. In the typical case, the database also contains a large number of unclassified
objects (sequences, structures), and much effort is put into either founding new classes for
some of these objects, or trying to fit them into one of the existing categories. It is noted
that a proximity measure can be used to establish a computable classification using one of
the many clustering methods. In a fortunate case the computed clustering is consistent to
the a priori known classification, and the potential new clusters that have no a priori
known counterparts are excellent candidates for discovering new, biologically relevant
classes.
Methods for representing a priori known categories can be grouped according to the
nature of description used for the individual categories [40]. Classical summary
descriptions are consensus descriptions that are valid for all members of a category.
Probabilistic summary descriptions are valid only with some probability. Consensus
descriptions such as sequence patterns can be pictured as the description of a prototype in
the given class. In contrast to consensus descriptions, exemplar-based descriptions
represent the categories as a database consisting of the members of the category. All of
these methods have been used e.g. for protein domain sequences. Domain sequence
collections and domain annotations in protein sequence databases are exemplar-based
descriptions. Regular expressions are classical summary (consensus) descriptions that are
supposed to be valid for all members, and there is a variety of statistical (probabilistic)
descriptions [40].
The problem of classification is one of the fundamental exercises in such fields a
domain sequence identification, or function prediction. Given a set of classes A i in a
database, the classification of a sequence is often based on minimal distance (or maximum
similarity). Oftentimes, the class A i of the closest object [
min , ] is
automatically assigned to an unclassified object. In other cases, the closest class is
determined from the consensus-representations of the classes, using
i
j
PM
S
,
A
i
j
i
min .
The use of mathematical spaces in the analysis of chemical structures is reviewed in [2, 18].
PM
S
,
A
!
i
5. Conclusions
Summarizing we can conclude that the description of structures as entity-relationship
networks provides a simple framework to describe the use of similarity in various fields.
There are a number of qualitative concepts, such as similarity groups (equivalence classes),
patterns as well as quantitative concepts, such as similarity measures that are present in all
fields. Mathematical spaces (“similarity spaces”) provide a way for describing databases as
well as the mathematical tools of analysis in a common framework. The definitions listed
in this review are applicable in other fields of bioinformatics not explicitly mentioned in
this review, such as the analysis semantic similarities [9] or the analysis of networks [41].
An overview of practical applications will be published in a subsequent chapter in this
volume [8].
The description of structures as entity-relationship networks provides a simple
framework to describe the use of similarity in various fields. There are a number of
Search WWH ::




Custom Search