Database Reference
In-Depth Information
in all dimensions, usually satisfied through the somewhat stricter require-
ment of the triangle inequality (
d
x
z
≤ d
x, y
d
y
z
)).
One important problem that occurs when trying to index sequences with
spatial access methods is the so-called
dimensionality curse
: Spatial indices
typically work only when the number of dimensions is low [Chakrabarti
and Mehrotra (1999)]. This makes it unfeasible to code the entire sequence
directly as a vector in an indexed space.
The general solution to this problem is
dimensionality reduction
:to
condense the original sequences into
signatures
in a
signature space
of low
dimensionality, in a manner which, to some extent, preserves the distances
between them. One can then index the signature space.
(
,
)
(
)+
(
,
3. Signature Based Similarity Search
x
n
A time sequence
of length
can be considered a vector or point in an
n
-dimensional space. Techniques exist (spatial access methods, such as the
R
-tree and variants [Chakrabarti and Mehrotra (1999), Wang and Perng
(2001), Sellis
et al.
(1987)] for indexing such data. The problem is that
the performance of such methods degrades considerably even for relatively
low dimensionalities [Chakrabarti and Mehrotra (1999)]; the number of
dimensions that can be handled is usually several orders of magnitude lower
than the number of data points in a typical time sequence.
A general solution described by Faloutsos
et al.
(1994; 1997) is to extract
a low-dimensional
signature
from each sequence, and to index the signature
space. This
shrink
and
search
approach is illustrated in Figure 2.
Fig. 2.
The signature based approach.