Information Technology Reference
In-Depth Information
Distance measures calculated between identical molecular descriptions (vectors) are
zero, and may grow without limit for non-identical vectors. It is sometimes desirable to
have bounded values, for example the so-called Canberra metric is defined as:
n
¦
A
B
[4]
i
i
C
n
i
1
M
¦
A
B
i
i
i
1
so it is zero for identical values but remains below unity for non-identical vectors. The
metric properties of vector distance measures are important for clustering and for
evolutionary studies. For M to be a metric (metric distance), the following criteria have to
be fulfilled for all A, B, C from X: i)
M
(
A
,
B
)
t
0
the equality holding if and only if
A ;
B
ii)
M t (triangular
inequality). Metric properties are essential if a distance measure is to be used for clustering.
A string similarity measures S (eqn. 5) is applicable to clustering if there is an associated
distance measure
M
(
A
,
B
)
M
(
B
,
A
)
(symmetry);
iii)
(
A
,
B
)
M
(
B
,
C
)
M
(
A
,
C
)
M that has metric properties. f is a monotonous function, and
distance measures such as 1-kS (where k is a constant) are routinely used in clustering
applications.
f
( S
)
2.4 String similarity measures for biological sequences
A special class of proximity measures, sequence similarity scores are used to quantify the
matching (alignment) of protein and DNA sequences. The underlying mathematical concept
is the string distance. Let us first concentrate on bit strings consisting of zero/one values.
( Figure 6A ). The Hamming distance is the number of (zero to one or one to zero) changes
necessary to change string 1 to string 2. This can be used immediately for short character
strings of identical length, with the condition that only exchanges are possible, gaps are not
allowed. If we keep this condition, we can use a simple lookup table to store the costs of
exchanging one character against another. The situation is the same if we use overlapping
doublet or triplet words, etc. i.e. the Hamming distance is a simple distance that can be
unequivocally computed based on lookup tables, because the matching of the two strings is
considered unique.
The situation is quite different if we match strings of arbitrary length and allow gaps
(Figure 7). The string edit distance is defined as the minimal number of steps (insertions,
deletions and replacements) necessary to transform one word into the other. The proximity
measures used for biological sequences are defined as similarity coefficients (high values
for similar, low for dissimilar sequences) and contain cost factors for residue substitutions
as well as gaps (insertions, deletions).
¦
¦
S
cos
t
cos
t
[5]
1
2
identities
,
tss
replacemen
gaps
Search WWH ::




Custom Search