SEGMENTING TIME SERIES: A SURVEY AND NOVEL APPROACH - Data Mining in Time Series Databases

Database Reference

In-Depth Information

2.4. Feature Comparison of the Major Algorithms

We have deliberately deferred the discussion of the running times of the

algorithms until now, when the reader's intuition for the various approaches

are more developed. The running time for each approach is data dependent.

For that reason, we discuss both a worst-case time that gives an upper

bound and a best-case time that gives a lower bound for each approach.

We use the standard notation of Ω(

(

)) for a lower bound,

(

)) for

an upper bound, and

(

)) for a function that is both a lower and upper

bound.

Definitions and Assumptions. The number of data points is

,the

number of segments we plan to create is

, and thus the average segment

length is

. The actual length of segments created by an algorithm

varies and we will refer to the lengths as

n/K

L i .

All algorithms, except top-down, perform considerably worse if we allow

any of the

L I

to become very large (say

/4), so we assume that the algo-

rithms limit the maximum length

to some multiple of the average length.

It is trivial to code the algorithms to enforce this, so the time analysis that

follows is exact when the algorithm includes this limit. Empirical results

show, however, that the segments generated (with no limit on length) are

tightly clustered around the average length, so this limit has little effect in

practice.

We assume that for each set

of points, we compute a best segment

and compute the error in

) time. This reflects the way these algorithms

are coded in practice, which is to use a packaged algorithm or function to

do linear regression. We note, however, that we believe one can produce

asymptotically faster algorithms if one custom codes linear regression (or

other best fit algorithms) to reuse computed values so that the computation

is done in less than

(

) time in subsequent steps. We leave that as a topic

for future work. In what follows, all computations of best segment and error

are assumed to be

(

Top-Down. The best time for Top-Down occurs if each split occurs at

the midpoint of the data. The first iteration computes, for each split point

, the best line for points [1

] and for points [

]. This takes

(

)for

n 2 ) total for all split points. The next iteration finds

each split point, or

(

split points for [1,

/2] and for [

/2 + 1,

]. This gives a recurrence

(

n 2 ).

This is a lower bound because we assumed the data has the best possible

split points.

n 2 ) where we have

2T(

/2) +

(

(2) =

, and this solves to

(

)=Ω(

Data Mining in Time Series Databases

Search WWH ::

Custom Search

Home