INDEXING OF COMPRESSED TIME SERIES - Data Mining in Time Series Databases

Database Reference

In-Depth Information

We have applied the compression procedure to the data sets in Table 1,

and compared it with two simple techniques: equally spaced points and

randomly selected points. We have experimented with different compression

rates , which are defined as the percentage of points removed from a series.

For example, “eighty-percent compression” means that we select 20% of

points and discard the other 80%.

For each compression technique, we have measured the difference

between the original series and the compressed series. We have considered

three measures of difference between the original series,

a 1 ,...,a n ,

and the

series interpolated from the compressed version,

b 1 ,...,b n .

Mean difference: n · i =1 |a i − b i |.

Maximum difference: max i ∈ [1 ,...,n ] |a i − b i |.

Root mean square difference: n · i =1

(

a i − b i ) 2 .

We summarize the results in Table 2, which shows that important points

are significantly more accurate than the two simple methods.

4. Similarity Measures

We define similarity between time series, which underlies the retrieval pro-

cedure. We measure similarity on a zero-to-one scale; zero means no likeness

and one means perfect likeness. We review three basic measures of similar-

ity and then propose a new measure. First, we define similarity between

two numbers,

a

and

b

:

|a − b|

|a|

a, b

−

|b| .

sim(

)=1

+

The mean similarity between two series,

a 1 ,...,a n and

b 1 ,...,b n ,

is the

mean of their point-by-point similarity:

1

n ·

n

sim(

a i ,b i )

.

i =1

We also define the root mean square similarity:

n

1

n ·

sim(

a i ,b i ) 2 .

i =1

In addition, we consider the correlation coecient, which is a standard

statistical measure of similarity. It ranges from

−

1 to 1, but we can convert

Data Mining in Time Series Databases

Search WWH ::

Custom Search

Home