MEDIAN STRINGS: A REVIEW - Data Mining in Time Series Databases

Database Reference

In-Depth Information

sequence of 2D points recorded from a person's writing on a special tablet.

Each digit is originally given as a sequence

x 1 ,y 1 ,t 1 )

,...,

x k ,y k ,t k )of

(

points in the

th point during

writing a digit. In order to transform such a sequence of points into a string,

we first resample the given data points such that the distance between any

consecutive pair of points has a constant value ∆. That is,

-plane, where

t k is the time of recording the

is transformed

d =(¯

into sequence

x 1 ,

y 1 )

,...,

(¯

x l ,

y l )where

(¯

x i +1 ,

y i +1 )

−

(¯

x i ,

y i )

=∆for

,...,l−

1. Then, a string

a 1 a 2 ···a l− 1 is generated from sequence

d where

y i +1 ). In our exper-

iments we fixed ∆ = 7 so that the digit strings have between 40 and 100

points.

The costs of the edit operations are defined as follows:

a i is the vector pointing from (¯

x i ,

y i )to(¯

x i +1 ,

(

a → ε

. Notice that the minimum cost of a

substitution is equal to zero (if and only if

(

ε → a

|a|

=∆,

(

a → b

|a − b|

), while the maximum cost is

2∆. The latter case occurs if

are parallel and have opposite direction.

We conducted experiments using 10 samples of digit 1, 2 and 3 each,

and 98 samples of digit 6, see Figure 2 for an illustration (for space

reasons only ten samples of digit 6 are shown there). The results of

four algorithms are shown in Figure 3: the genetic algorithm [17] (GA,

Section 5.2.2), the dynamic algorithm [18] (Section 5.3), the greedy algo-

rithm [4] (Section 5.2.1), and the set median. For comparison purpose the

consensus error

and

E S and the computation time

in seconds on a SUN Ultra60

workstation are also listed in Figure 3. Note that the consensus errors of

digit 6 are substantially larger than those of the other digits because of

the definition of consensus error as the sum, but not the average, of the

distances to all input samples.

The best results are achieved by GA, followed by the dynamic approach.

Except for digit 1, the greedy algorithm reveals some weakness. Looking

at the median for digits 2, 3 and 6 it seems that the iterative process ter-

minates too early, resulting in a string (digit) much shorter than it should

be. The reason lies in the simple termination criterion defined in [4]. It

works well for the (short) words used there, but obviously encounters di-

culties in dealing with longer strings occurring in our study. At first glance,

the dynamic approach needs more computation time than the greedy algo-

rithm. But one has to take into account that the recorded time is the total

time of the dynamic process of adding one sample to the existing set each

time, starting from a set consisting of the first sample. Therefore, totally

9 (97) generalized medians have effectively been computed for digits 1/2/3

(6). Taking the actual computation time for each update step of the whole

Data Mining in Time Series Databases

Search WWH ::

Custom Search

Home