Database Reference
In-Depth Information
A property of metrics is:
|d
(
p, r
)
− d
(
r, q
)
|≤d
(
p, q
)
,
∀p, q, r ∈ S,
which can be utilized to reduce the number of string distance computations.
The approach proposed in [19] partitions the input set
S
into subsets
S u (used),
S a keeps track of those
strings that have not been fully evaluated yet; initially
S e (eliminated), and
S a (alive). The set
S a
S
=
.Alower
bound
g
(
p
) is computed for each string
p
in
S a , i.e., the consensus error of
p
satisfies:
)=
q∈S
E S (
p
d
(
p, q
)
≥ g
(
p
)
.
values are potentially better candidates for
set median. For this reason the string
Clearly, strings with small
g
p
with the smallest
g
(
p
) value among
all strings in
S a
is transferred from
S a
to
S u . Then, the consensus error
E S (
p
p
) is computed and, if necessary, the current best median candidate
g
is updated. Then, the lower bound
is computed for all strings that are
alive, and those whose
g
is not smaller than
E S (
p ) are moved from
S a
to
S e . They will not be considered as median candidates any longer. This
process is repeated until
S a becomes empty.
In each iteration, the consensus error for
p
with the smallest
g
value is
computed by:
)=
q∈S u
)+
q∈S e ( S a −{p} )
E S (
p
d
(
p, q
d
(
p, q
)
.
Using (1) the term
d
(
p, q
) in the second summation is estimated by:
d
(
p, q
)
≥|d
(
p, r
)
− d
(
r, q
)
|,
∀r ∈ S u .
Taking all strings of
S u into account, we obtain the lower bound:
)+
q∈S e ( S a −{p} )
E S (
p
)
d
(
p, q
max
rεS u |d
(
p, r
)
− d
(
r, q
)
|
=
g
(
p
)
.
q∈S u
The critical point here is to see that all the distances in this lower
bound are concerned with
p
and strings from
S u , and were therefore already
computed before. When strings in
S u ), their
consensus errors need not to be computed in future. This fact results in sav-
ing of distance computations. In addition to (2), two other lower bounds
within the same algorithmic framework are given in [19]. They differ in the
resulting ratio of the number of distance computations and the remain-
ing overhead, with the lower bound (2) requiring the smallest amount of
distance computations.
S a
are eliminated (moved to
Search WWH ::




Custom Search