A SURVEY OF RECENT METHODS FOR EFFICIENT RETRIEVAL OF SIMILAR TIME SEQUENCES - Data Mining in Time Series Databases

Database Reference

In-Depth Information

points by a compression factor typically in the range from 10 to 600 for

real data [Keogh (1997)], outperforming methods based on the Discrete

Fourier Transform by one to three orders of magnitude [Keogh and Pazzani

(1999b)]. This approximation is shown to be valid under several distance

measures, including dynamic time warping distance [Keogh and Pazzani

(1999b)]. An enhanced representation is introduced in [Keogh and Pazzani

(1998)], where every line segment in the approximation is augmented with

a weight representing its relative importance; for instance, a combined

sequence may be constructed representing a class of sequences, and some

line segments may be more representative of the class than others.

4.3. Search Space Pruning through Subsequence Hashing

Keogh and Pazzani (1999a) describe an indexing method based on hashing,

in addition to the piecewise linear approximation. An equi-spaced template

grid window is moved across the sequence, and for each position a hash key

is generated to decide into which bin the corresponding subsequence is put.

The hash key is simply a binary string, where 1 means that the sequence

is predominantly increasing in the corresponding part of the template grid,

while 0 means that it is decreasing. These bin keys may then be used during

a search, to prune away entire bins without examining their contents. To get

more benefit from the bin pruning, the bins are arranged in a best-first order.

5. Conclusion

This chapter has sought to give an overview of recent advances in the field of

similarity based retrieval in time sequence databases. First, the problem of

similarity search and the desired properties of robust distance measures and

good indexing methods were outlined. Then, the general approach of signa-

ture based similarity search was described. Following the general descrip-

tion, three specific signature extraction approaches were discussed: Spectral

signatures, based on Fourier components (or wavelet components); piece-

wise constant approximation, and the related method adaptive piecewise

constant approximation; and landmark methods, based on the extraction of

significant points in a sequence. Finally, some methods that are not based

on signature extraction were mentioned.

Although the field of time sequence indexing has received much atten-

tion and is now a relatively mature field [Keogh et al . (2002)] there are still

areas where further research might be warranted. Two such areas are (1)

thorough empirical comparisons and (2) applications in data mining.

Search WWH ::

Custom Search

Home