Database Reference
In-Depth Information
A property of metrics is:
|d
(
p, r
)
− d
(
r, q
)
|≤d
(
p, q
)
,
∀p, q, r ∈ S,
which can be utilized to reduce the number of string distance computations.
The approach proposed in [19] partitions the input set
S
into subsets
S
u
(used),
S
a
keeps track of those
strings that have not been fully evaluated yet; initially
S
e
(eliminated), and
S
a
(alive). The set
S
a
S
=
.Alower
bound
g
(
p
) is computed for each string
p
in
S
a
, i.e., the consensus error of
p
satisfies:
)=
q∈S
E
S
(
p
d
(
p, q
)
≥ g
(
p
)
.
values are potentially better candidates for
set median. For this reason the string
Clearly, strings with small
g
p
with the smallest
g
(
p
) value among
all strings in
S
a
is transferred from
S
a
to
S
u
. Then, the consensus error
E
S
(
p
p
∗
) is computed and, if necessary, the current best median candidate
g
is updated. Then, the lower bound
is computed for all strings that are
alive, and those whose
g
is not smaller than
E
S
(
p
∗
) are moved from
S
a
to
S
e
. They will not be considered as median candidates any longer. This
process is repeated until
S
a
becomes empty.
In each iteration, the consensus error for
p
with the smallest
g
value is
computed by:
)=
q∈S
u
)+
q∈S
e
∪
(
S
a
−{p}
)
E
S
(
p
d
(
p, q
d
(
p, q
)
.
Using (1) the term
d
(
p, q
) in the second summation is estimated by:
d
(
p, q
)
≥|d
(
p, r
)
− d
(
r, q
)
|,
∀r ∈ S
u
.
Taking all strings of
S
u
into account, we obtain the lower bound:
)+
q∈S
e
∪
(
S
a
−{p}
)
E
S
(
p
)
≥
d
(
p, q
max
rεS
u
|d
(
p, r
)
− d
(
r, q
)
|
=
g
(
p
)
.
q∈S
u
The critical point here is to see that all the distances in this lower
bound are concerned with
p
and strings from
S
u
, and were therefore already
computed before. When strings in
S
u
), their
consensus errors need not to be computed in future. This fact results in sav-
ing of distance computations. In addition to (2), two other lower bounds
within the same algorithmic framework are given in [19]. They differ in the
resulting ratio of the number of distance computations and the remain-
ing overhead, with the lower bound (2) requiring the smallest amount of
distance computations.
S
a
are eliminated (moved to