Data Mining Trends and Research Frontiers - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

Similarity Search in Time-Series Data

A time-series data set consists of sequences of numeric values obtained over repeated

measurements of time. The values are typically measured at equal time intervals (e.g.,

every minute, hour, or day). Time-series databases are popular in many applications

such as stock market analysis, economic and sales forecasting, budgetary analysis, util-

ity studies, inventory studies, yield projections, workload projections, and process and

quality control. They are also useful for studying natural phenomena (e.g., atmosphere,

temperature, wind, earthquake), scientific and engineering experiments, and medical

treatments.

Unlike normal database queries, which find data that match a given query exactly ,

a similarity search finds data sequences that differ only slightly from the given query

sequence. Many time-series similarity queries require subsequence matching , that is,

finding a set of sequences that contain subsequences that are similar to a given query

sequence.

For similarity search, it is often necessary to first perform data or dimensionality

reduction and transformation of time-series data. Typical dimensionality reduction tech-

niques include (1) the discrete Fourier transform ( DFT ), (2) discrete wavelet transforms

(DWT) , and (3) singular value decomposition ( SVD ) based on principle components anal-

ysis ( PCA ). Because we touched on these concepts in Chapter 3, and because a thorough

explanation is beyond the scope of this topic, we will not go into great detail here. With

such techniques, the data or signal is mapped to a signal in a transformed space . A small

subset of the “strongest” transformed coefficients are saved as features.

These features form a feature space , which is a projection of the transformed space.

Indices can be constructed on the original or transformed time-series data to speed

up a search. For a query-based similarity search, techniques include normalization

transformation, atomic matching (i.e., finding pairs of gap-free windows of a small

length that are similar), window stitching (i.e., stitching similar windows to form pairs

of large similar subsequences, allowing gaps between atomic matches), and subse-

quence ordering (i.e., linearly ordering the subsequence matches to determine whether

enough similar pieces exist). Numerous software packages exist for a similarity search in

time-series data.

Recently, researchers have proposed transforming time-series data into piecewise

aggregate approximations so that the data can be viewed as a sequence of symbolic rep-

resentations. The problem of similarity search is then transformed into one of matching

subsequences in symbolic sequence data. We can identify motifs (i.e., frequently occur-

ring sequential patterns) and build index or hashing mechanisms for an efficient search

based on such motifs. Experiments show this approach is fast and simple, and has

comparable search quality to that of DFT, DWT, and other dimensionality reduction

methods.

Regression and Trend Analysis in Time-Series Data

Regression analysis of time-series data has been studied substantially in the fields of

statistics and signal analysis. However, one may often need to go beyond pure regression

Search WWH ::

Custom Search

Home