SEGMENTING TIME SERIES: A SURVEY AND NOVEL APPROACH - Data Mining in Time Series Databases

Database Reference

In-Depth Information

Surprisingly, in spite of the ubiquity of this representation, with the

exception of [Shatkay (1995)], there has been little attempt to understand

and compare the algorithms that produce it. Indeed, there does not even

appear to be a consensus on what to call such an algorithm. For clarity, we

will refer to these types of algorithm, which input a time series and return

a piecewise linear representation, as segmentation algorithms.

The segmentation problem can be framed in several ways.

•

Given a time series

T

, produce the best representation using only K

segments.

•

Given a time series

, produce the best representation such that the maxi-

mum error for any segment does not exceed some user-specified threshold,

max error.

T

•

Given a time series

, produce the best representation such that the

combined error of all segments is less than some user-specified threshold,

total max error.

T

As we shall see in later sections, not all algorithms can support all these

specifications.

Segmentation algorithms can also be classified as batch or online. This is

an important distinction because many data mining problems are inherently

dynamic [Vullings et al. (1997), Koski et al. (1995)].

Data mining researchers, who needed to produce a piecewise linear

approximation, have typically either independently rediscovered an algo-

rithm or used an approach suggested in related literature. For example,

from the fields of cartography or computer graphics [Douglas and Peucker

(1973), Heckbert and Garland (1997), Ramer (1972)].

In this chapter, we review the three major segmentation approaches

in the literature and provide an extensive empirical evaluation on a very

heterogeneous collection of datasets from finance, medicine, manufacturing

and science. The major result of these experiments is that only online algo-

rithm in the literature produces very poor approximations of the data, and

that the only algorithm that consistently produces high quality results and

scales linearly in the size of the data is a batch algorithm. These results

motivated us to introduce a new online algorithm that scales linearly in the

size of the data set, is online, and produces high quality approximations.

The rest of the chapter is organized as follows. In Section 2, we provide

an extensive review of the algorithms in the literature. We explain the basic

approaches, and the various modifications and extensions by data miners. In

Section 3, we provide a detailed empirical comparison of all the algorithms.

Data Mining in Time Series Databases

Search WWH ::

Custom Search

Home