INDEXING OF COMPRESSED TIME SERIES - Data Mining in Time Series Databases

Database Reference

In-Depth Information

Similarity measures. Several researchers have defined similarity as

the distance between points in a feature space. For example, Caraca-Valente

and Lopez-Chavarrias (2000) used Euclidean distance between feature vec-

tors containing angle of knee movement and muscle strength, and Lee et al.

(2000) applied Euclidean distance to compare feature vectors containing

color, texture, and shape of video data. This technique works well when all

features have the same units of scale [Goldin and Kanellakis (1995)], but it

is often ineffective for combining disparate features.

An alternative definition of similarity is based on bounding rectan-

gles; two series are similar if their bounding rectangles are similar. It

allows fast pruning of clearly dissimilar series [Perng et al. (2000), Lee

et al. (2000)], but it is less effective for selecting the most similar series.

The envelope-count technique is based on dividing a series into short seg-

ments, called envelopes, and defining a yes/no similarity for each envelope.

Two series are similar within an envelope if their point-by-point differences

are within a certain threshold. The overall similarity is measured by the

number of envelopes where the series are similar [Agrawal et al. (1996)].

This measure allows fast computation of similarity, and it can be adapted

for noisy and missing data [Das et al. (1997), Bollobas et al. (1997)].

Finally, we can measure a point-by-point similarity of two series and

then aggregate these measures, which often requires interpolation of missing

points. For example, Keogh and Pazzani (1998) used linear interpolation

with this technique, and Perng et al. (2000) applied cubic approximation.

Keogh and Pazzani (2000) also described a point-by-point similarity with

modified Euclidean distance, which does not require interpolation.

Indexing and retrieval. Researchers have studied a variety of tech-

niques for indexing of time series. For example, Deng (1998) applied kd -

trees to arrange series by their significant features, Chan and Fu (1999)

combined wavelet transforms with R-trees, and Bozkaya and her col-

leagues used vantage-point trees for indexing series by numeric features

[Bozkaya et al. (1997), Bozkaya and Ozsoyoglu (1999)]. Park et al. (2001)

indexed series by their local extrema and by properties of the segments

between consecutive extrema. Li et al. (1998) proposed a retrieval technique

based on a multi-level abstraction hierarchy of features. Aggarwal and Yu

(2000) considered grid structures, but found that the grid performance is

often no better than exhaustive search. They also showed that exhaustive

search among compressed series is often faster than sophisticated indexing

techniques.

Search WWH ::

Custom Search

Home