Database Reference
In-Depth Information
Ideally, the distance function is desired to be a metric, in order to match
the human intuition of similarity. The triangle inequality excludes the case
in which
) is very large. In prac-
tice, however, there may exist distance functions which do not satisfy the
triangle inequality. To judge the suitability of these distance functions, the
work [6] suggests quasi-metrics with a relaxed triangle inequality. Instead
of the strict triangle inequality, the relation:
d
(
p, r
)and
d
(
r, q
) are both small, but
d
(
p, q
≥
d
(
p, q
)
d
(
p, r
)+
d
(
r, q
)
1+
ε
is required now. Here
is not
very large, the relaxed triangle inequality still retains the human intuition
of similarity. Note that the strict triangle inequality is a special case with
ε
ε
is a small nonnegative constant. As long as
ε
= 0. The fast set median search procedure [19] sketched above easily
extends to quasi-metrics. In this case the relationship (1) is replaced by:
max
d
)
(
p, r
)
,
d
(
q, r
)
d
(
p, q
)
≥
− d
(
q, r
)
− d
(
p, r
,
∀p, q, r ∈ S
1+
ε
1+
ε
which can be used in the same manner to establish a lower bound
g
(
p
).
4.2.
Approximate Set Median Search in Arbitrary Spaces
Another approach to fast set median search makes no assumption on the
distance function and covers therefore non-metrics as well. The idea of this
approximate algorithm is simple. Instead of computing the sum of distances
of each string to all the other strings of
S
to select the best one, only a
subset of
is used to obtain an estimation of the consensus error [30]. The
algorithm first calculates such estimations and then calculates the exact
consensus errors only for strings that have low estimations.
This approximate algorithm proceeds in two steps. First, a random sub-
S
set
S
r
of
N
r
strings is selected from
S
.Foreachstring
p
of
S
, the consensus
error
E
S
r
(
p
) relative to
S
r
is computed and serves as an estimation of the
consensus error
N
t
strings with the lowest con-
sensus error estimations are chosen. The exact consensus error
E
S
(
p
). In the second step
E
S
(
p
)is
computed for the
N
t
strings and the string with the minimum
E
S
(
p
)is
regarded the (approximate) set median string of
S
.
5. Computation of Generalized Median Strings
While the set median problem is characterized by selecting one particular
member out of a given set of strings, the computation of generalized median