Database Reference
In-Depth Information
A large number of operations and algorithms have been proposed to
deal with strings [1,5,13,34,36]. Some of them are inherent to the special
nature of strings such as the shortest common superstring and the longest
common substring, while others are adapted from other domains.
In data mining, clustering and machine learning, a typical task is to
represent a set of (similar) objects by means of a single prototype. Interest-
ing applications of the median concept have been demonstrated in dealing
with 2D shapes [16, 33], binary feature maps [23], 3D rotation [9], geomet-
ric features (points, lines, or 3D frames) [32], brain models [12], anatomical
structures [37], and facial images [31]. In this paper we discuss the adapta-
tion of the median concept to the domain of strings.
The median concept is useful in various contexts. It represents a funda-
mental quantity in statistics. In sensor fusion, multisensory measurements
of some quantity are averaged to produce the best estimate. Averaging the
results of several classifiers is used in multiple classifier systems in order to
achieve more reliable classifications.
The outline of the chapter is as follows. We first formally introduce the
median string problem in Section 2 and provide some related theoretical
results in Section 3. Sections 4 and 5 are devoted to algorithmic procedures
for eciently computing set median and generalized median strings. In
Section 6 we report experimental results to demonstrate the median concept
and to compare some of the considered algorithms. Finally, some discussions
conclude this paper.
2. Median String Problem
x
Assuming an alphabet Σ of symbols, a string
is simply a sequence of
symbols from Σ, i.e.
x
=
x 1 x 2 ···x n
where
x i
Σfor
i
=1
,...,n
.Given
the space
U
of all strings over Σ, we need a distance function
d
(
p, q
)to
measure the dissimilarity between two strings
p, q ∈ U
.Let
S
be a set of
N
strings from
U
. The essential information of
S
is captured by a string
p ∈ U
¯
that minimizes the sum of distances of ¯
p
to all strings from
S
,also
called the consensus error
E S (
p
):
)=
q∈S
p
¯
=arg min
p∈U E S (
p
)
,
where
E S (
p
d
(
p, q
)
.
The string ¯
p
is called a generalized median of
S
. If the search is con-
strained to the given set
S
, the resultant string
p
ˆ
=arg min
p∈S E S (
p
)
Search WWH ::




Custom Search