Database Reference
In-Depth Information
lies the main obstacle of such data; they may contain a significant amount
of outliers (Figure 1) or in other words incorrect data measurements.
We argue that the use of non-metric distance functions is prefer-
able, since such functions can effectively 'ignore' the noisy sections of the
sequences.
The rest of the chapter is organized as follows. In section 2 we will review
the most prevalent similarity measures used in time-series databases, we
will demonstrate the need for non-metric distance functions and we will also
consider related work. In section 3 we formalize the new similarity functions
by extending the Longest Common Subsequence (LCSS) model. Section 4
demonstrates ecient algorithms to compute these functions and section 5
elaborates on the indexing structure. Section 6 provides the experimental
validation of the accuracy and eciency of the proposed approach. Finally,
section 7 concludes the chapter.
2. Background
Indexing of time-series has concentrated great attention in the research
community, especially in the last decade, and this can partly be attributed
to the explosion of database sizes. Characteristic examples are environmen-
tal data collected on a daily basis or satellite image databases of the earth 1 .
If we would like to allow the user to explore these vast databases, we should
organize the data in such a way, so that one can retrieve accurately and
eciently the data of interest.
More specifically, our objective is the automatic classification of time-
series using Nearest Neighbor Classification (NNC). In NNC the time-series
query is classified according to the majority of its nearest neighbors. The
NNC is conceptually a simple technique, but provides very good results
in practical situations. This classification technique is particularly suited
for our setting, since we are going to use a non-metric distance function.
Furthermore, the technique has good theoretical properties; it has been
shown that the one nearest neighbor rule has asymptotic error rate that is
at most twice the Bayes error rate [13].
So, the problem we consider is: “Given a database
D
of time-series and
aquery
Q
(not already in the database), find the sequence
T
that is closest
to
.” We need to define the following:
1. A realistic distance function that will match the user's perception of
what is considered similar.
Q
1 http://www.terraserver.com
Search WWH ::




Custom Search