Concepts of Similarity in Bioinformatics - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

between two objects is computable by an unequivocal algorithm. A distance of two stars in

space is a good example. For instance, such measures are computed between unstructured

descriptions like vectors. One can define simple proximity measures also for structured

descriptions, provided the equivalences of the entities (residues, atoms) are a priori defined.

The Hamming distance and rmsd are such measures for character strings and 3D structures

respectively. They are based on straightforward algorithms for calculating a distance

between the two objects. Such distances are often calculated between fragments of larger

structures hence they are sometimes called fragment distances. B) Substructure proximity

measures are computed between parts structured descriptions. They require a simple

measure as well as an algorithm to select the “optimal substructure” in the two objects. For

instance, the distance of two galaxies can be defined as the distance between their closest

stars. In this case, we have to measure the distance between all possible pairs (simple

proximity), and then select the smallest one. The two central problems of bioinformatics, -

sequence alignment and 3D structural alignment - are substructure similarity problems. So

instead of two objects, we need compare “galaxies of substructures” which is a compute

intensive task. The complexity of the calculation is different (sequence alignment has a

complexity O(n,m) while structural alignment is np -complete), but the the basic concepts of

substructure selection matching- described in the subsequent paragraph 3.3 - is common

to both. The present section will concentrate mostly on simple proximity measures. We will

follow the classification of Sneath and Sokal [25] who distinguished four classes proximity

measures: distance coefficients, association coefficients, correlation coefficients and

probabilistic coefficients.

2.3 Distance measures

Vector distance measures are perhaps the simplest class of similarity measures owing to

their geometric interpretation. The most common is probably the Euclidean distance,

which, for some pair of objects A and B, described by n-dimensional vectors A i and B i ,

respectively, is defined as:

[1]

This is a distance defined in the n-dimensional space. A simple variant of this

formula is the average distance, which is simply the Euclidean distance divided by the

number of dimensions, i.e., by n in this case. The generalization of the Euclidean distance

leads to a class of metric distance functions called the Minkowski metrics, defined by the

following general formula:

[2]

(

)

r to the Euclidean distance.

The so-called sup distance is the Minkowski metric of r = f and corresponds to

corresponds to the “city block” distance and

[3]

(

)

max

Essays in Bioinformatics

Search WWH ::

Custom Search

Home