SEGMENTING TIME SERIES: A SURVEY AND NOVEL APPROACH - Data Mining in Time Series Databases

Database Reference

In-Depth Information

(i)

(vi)

(ii)

(vii)

(iii)

(viii)

(iv)

(ix)

(v)

(x)

Fig. 3. The 10 datasets used in the experiments. (i) Radio Waves. (ii) Exchange

Rates. (iii) Tickwise II. (iv) Tickwise I. (v) Water Level. (vi) Manufacturing. (vii) ECG.

(viii) Noisy Sine Cubed. (ix) Sine Cube. (x) Space Shuttle.

distance is just the square root of the sum of squares). Euclidean dis-

tance, or some measure derived from it, is by far the most common metric

used in data mining of time series [Agrawal et al. (1993), Agrawal et al.

(1995), Chan and Fu (1999), Das et al. (1998), Keogh et al. (2000), Keogh

and Pazzani (1999), Keogh and Pazzani (1998), Keogh and Smyth (1997),

Qu et al. (1998), Wang and Wang (2000)]. The linear interpolation ver-

sions of the algorithms, by definition, will always have a greater sum of

squares error.

We immediately encounter a problem when attempting to compare the

algorithms. We cannot compare them for a fixed number of segments, since

Sliding Windows does not allow one to specify the number of segments.

Instead we give each of the algorithms a fixed max error and measure the

total error of the entire piecewise approximation.

The performance of the algorithms depends on the value of max error .

As max error goes to zero all the algorithms have the same performance,

since they would produce

/2 segments with no error. At the opposite end,

as max error becomes very large, the algorithms once again will all have

the same performance, since they all simply approximate T with a single

best-fit line. Instead, we must test the relative performance for some rea-

sonable value of max error , a value that achieves a good trade off between

compression and fidelity. Because this “reasonable value” is subjective and

dependent on the data mining application and the data itself, we did the fol-

lowing. We chose what we considered a “reasonable value” of max error for

each dataset, and then we bracketed it with 6 values separated by powers of

two. The lowest of these values tends to produce an over-fragmented approx-

imation, and the highest tends to produce a very coarse approximation. So

in general, the performance in the mid-range of the 6 values should be

considered most important. Figure 4 illustrates this idea.

n

Data Mining in Time Series Databases

Search WWH ::

Custom Search

Home