Information Technology Reference
In-Depth Information
between two objects is computable by an unequivocal algorithm. A distance of two stars in
space is a good example. For instance, such measures are computed between unstructured
descriptions like vectors. One can define simple proximity measures also for structured
descriptions, provided the equivalences of the entities (residues, atoms) are a priori defined.
The Hamming distance and rmsd are such measures for character strings and 3D structures
respectively. They are based on straightforward algorithms for calculating a distance
between the two objects. Such distances are often calculated between fragments of larger
structures hence they are sometimes called fragment distances. B) Substructure proximity
measures are computed between parts structured descriptions. They require a simple
measure as well as an algorithm to select the “optimal substructure” in the two objects. For
instance, the distance of two galaxies can be defined as the distance between their closest
stars. In this case, we have to measure the distance between all possible pairs (simple
proximity), and then select the smallest one. The two central problems of bioinformatics, -
sequence alignment and 3D structural alignment - are substructure similarity problems. So
instead of two objects, we need compare “galaxies of substructures” which is a compute
intensive task. The complexity of the calculation is different (sequence alignment has a
complexity O(n,m) while structural alignment is np -complete), but the the basic concepts of
substructure selection matching- described in the subsequent paragraph 3.3 - is common
to both. The present section will concentrate mostly on simple proximity measures. We will
follow the classification of Sneath and Sokal [25] who distinguished four classes proximity
measures: distance coefficients, association coefficients, correlation coefficients and
probabilistic coefficients.
2.3 Distance measures
Vector distance measures are perhaps the simplest class of similarity measures owing to
their geometric interpretation. The most common is probably the Euclidean distance,
which, for some pair of objects A and B, described by n-dimensional vectors A i and B i ,
respectively, is defined as:
1
§
n
·
2
¦
[1]
2
D
¨
©
A
B
¸
¹
E
i
i
i
1
This is a distance defined in the n-dimensional space. A simple variant of this
formula is the average distance, which is simply the Euclidean distance divided by the
number of dimensions, i.e., by n in this case. The generalization of the Euclidean distance
leads to a class of metric distance functions called the Minkowski metrics, defined by the
following general formula:
1
[2]
§
n
·
r
¦
r
D
(
r
)
¨
©
A
B
¸
¹
M
i
i
i
1
r to the Euclidean distance.
The so-called sup distance is the Minkowski metric of r = f and corresponds to
corresponds to the “city block” distance and
r
1
2
[3]
D
(
f
)
max
A
B
M
1
i
d
n
i
i
Search WWH ::




Custom Search