Database Reference
In-Depth Information
Surprisingly, in spite of the ubiquity of this representation, with the
exception of [Shatkay (1995)], there has been little attempt to understand
and compare the algorithms that produce it. Indeed, there does not even
appear to be a consensus on what to call such an algorithm. For clarity, we
will refer to these types of algorithm, which input a time series and return
a piecewise linear representation, as segmentation algorithms.
The segmentation problem can be framed in several ways.
Given a time series
T
, produce the best representation using only K
segments.
Given a time series
, produce the best representation such that the maxi-
mum error for any segment does not exceed some user-specified threshold,
max error.
T
Given a time series
, produce the best representation such that the
combined error of all segments is less than some user-specified threshold,
total max error.
T
As we shall see in later sections, not all algorithms can support all these
specifications.
Segmentation algorithms can also be classified as batch or online. This is
an important distinction because many data mining problems are inherently
dynamic [Vullings et al. (1997), Koski et al. (1995)].
Data mining researchers, who needed to produce a piecewise linear
approximation, have typically either independently rediscovered an algo-
rithm or used an approach suggested in related literature. For example,
from the fields of cartography or computer graphics [Douglas and Peucker
(1973), Heckbert and Garland (1997), Ramer (1972)].
In this chapter, we review the three major segmentation approaches
in the literature and provide an extensive empirical evaluation on a very
heterogeneous collection of datasets from finance, medicine, manufacturing
and science. The major result of these experiments is that only online algo-
rithm in the literature produces very poor approximations of the data, and
that the only algorithm that consistently produces high quality results and
scales linearly in the size of the data is a batch algorithm. These results
motivated us to introduce a new online algorithm that scales linearly in the
size of the data set, is online, and produces high quality approximations.
The rest of the chapter is organized as follows. In Section 2, we provide
an extensive review of the algorithms in the literature. We explain the basic
approaches, and the various modifications and extensions by data miners. In
Section 3, we provide a detailed empirical comparison of all the algorithms.
 
Search WWH ::




Custom Search