Database Reference
In-Depth Information
2. The Problem
The problem of retrieving similar time sequences may be stated as follows:
Given a sequence
q
, a set of time sequences
X
, a (non-negative) distance
measure
d
,anda
tolerance threshold
ε
, find the set
R
of sequences closer
to
q
than
ε
, or, more precisely:
R
{x ∈ X|d
q, x
≤ ε}.
=
(
)
(3)
Alternatively, one might wish to find the
k
nearest neighbours of
q
,which
amounts to setting
ε
so that
|R|
=
k
. The parameter
ε
is typically supplied
by the user, while the distance function
is domain-dependent. Several
distance measures will be described rather informally in this chapter. For
more formal definitions, see Appendix.
Figure 1 illustrates the problem for Euclidean distance in two dimen-
sions. In this example, the vector
d
x
will be included in the result set
R
,
while
will not.
A useful variation of the problem is to find a set of
subsequences
of the
sequences in
X
. This, in the basic case, requires comparing
y
q
not only to
, but to all possible subsequences.
2
If a method retrieves a subset
all elements of
X
S
of
R
, the wrongly dismissed sequences
in
R − S
are called
false dismissals
.Conversely,if
S
is a superset of
R
,the
sequences in
S − R
are called
false alarms.
Fig. 1.
Similarity retrieval.
2
Except in the description of LCS in Appendix,
subsequence
means
contiguous subse-
quence
,or
segment.