Database Reference
In-Depth Information
2.4.
Feature Comparison of the Major Algorithms
We have deliberately deferred the discussion of the running times of the
algorithms until now, when the reader's intuition for the various approaches
are more developed. The running time for each approach is data dependent.
For that reason, we discuss both a worst-case time that gives an upper
bound and a best-case time that gives a lower bound for each approach.
We use the standard notation of Ω(
f
(
n
)) for a lower bound,
O
(
f
(
n
)) for
an upper bound, and
θ
(
f
(
n
)) for a function that is both a lower and upper
bound.
Definitions and Assumptions.
The number of data points is
n
,the
number of segments we plan to create is
K
, and thus the average segment
length is
. The actual length of segments created by an algorithm
varies and we will refer to the lengths as
L
=
n/K
L
i
.
All algorithms, except top-down, perform considerably worse if we allow
any of the
L
I
to become very large (say
n
/4), so we assume that the algo-
rithms limit the maximum length
to some multiple of the average length.
It is trivial to code the algorithms to enforce this, so the time analysis that
follows is exact when the algorithm includes this limit. Empirical results
show, however, that the segments generated (with no limit on length) are
tightly clustered around the average length, so this limit has little effect in
practice.
We assume that for each set
L
S
of points, we compute a best segment
and compute the error in
) time. This reflects the way these algorithms
are coded in practice, which is to use a packaged algorithm or function to
do linear regression. We note, however, that we believe one can produce
asymptotically faster algorithms if one custom codes linear regression (or
other best fit algorithms) to reuse computed values so that the computation
is done in less than
θ
(
n
) time in subsequent steps. We leave that as a topic
for future work. In what follows, all computations of best segment and error
are assumed to be
O
(
n
θ
(
n
).
Top-Down.
The best time for Top-Down occurs if each split occurs at
the midpoint of the data. The first iteration computes, for each split point
i
, the best line for points [1
,i
] and for points [
i
+1
,n
]. This takes
θ
(
n
)for
n
2
) total for all split points. The next iteration finds
each split point, or
θ
(
split points for [1,
n
/2] and for [
n
/2 + 1,
n
]. This gives a recurrence
T
(
n
)=
n
2
).
This is a lower bound because we assumed the data has the best possible
split points.
n
2
) where we have
2T(
n
/2) +
θ
(
T
(2) =
c
, and this solves to
T
(
n
)=Ω(