Information Technology Reference
In-Depth Information
positions. The authors demonstrate that order invariance is an important consideration
for domains such as automotive engineering and smart home environments [ 33 , 35 ],
where multiple sensors observe contextual patterns in their naturally occurring order,
and time series are compared according the occurrence of these multivariate patterns.
Evaluation Criterion . Evaluation criteria for clustering are distinguished between
known ground truth and unknown ground truth [ 14 ]. In case of known ground truth,
the similarity between known clusters and obtained clusters can be measured. The
most commonly used clustering quality measure for known ground truth is the Rand
Index or minor variants of it [ 40 ]. In contrast, without prior knowledge the clusters
are usually evaluated according their within-cluster similarity and between-cluster
dissimilarity [ 14 ]. Various validity indices have been proposed to determine the
number of clusters and their goodness. For instance, the index I has been found to
be consistent and reliable, irrespective of the underlying clustering technique and
data dimensionality, and furthermore has been shown to outperform the Dunn and
David-Bouldin index [ 24 ].
Realistic Assumptions . The majority of publicly available time series datasets
were preprocessed and cleaned before publishing. For instance, the UCR archive [ 9 ]
contains only time series with equal length, which are mostly snippets of the origi-
nal data that were retrieved manually. The publication of perfectly aligned patterns
of equal length has lead to huge amount of time series classification and clustering
algorithms that are not able to deal with real-world data, which contains irrelevant
sections. Hu et al. [ 5 ] suggest to automatically build a data dictionary, which contains
only a small subset of the training data and neglects irrelevant sections and redun-
dancies. The evaluations show that using a data dictionary with a set of retrieved
subsequences for each class leads to higher classification accuracy and is several
time faster than the compared strawman algorithms. However, one needs to be care-
ful about how to retrieve subsequences, for reasons explained in the following.
Subsequence Clustering . Keogh and Lin [ 12 ] state that the clustering of time
series subsequences is meaningless, referring to the finding that the output does not
depend on input, and the resulting cluster centers are close to random ones. In almost
all cases the subsequences are extracted with a sliding window, which is assumed to
the quirk in clustering. To produce meaningful results the authors suggest to adopt
time seriesmotifs, a concept highly related to clusters. Their experiments demonstrate
that motif-based clustering is able to preserve the patterns found in the original time
series data [ 12 ].
Time Series Motifs . Motifs are previously unknown, frequently occurring
patterns, which are useful for various time series mining tasks: such as summa-
rization, visualization, clustering and classification of time series [ 2 , 16 ]. According
to the definition [ 16 ] a time series motif is a subsequence that comprises all non-
trivial matches within a given range. Since the naive (brute-force) approach to motif
discovery has quadratic complexity, Lin et al. [ 16 ] introduce a new motif discov-
ery algorithm that provides fast exact answers, and faster approximate answers,
achieving a speedup of one to two orders of magnitude. In order to reduce the num-
Search WWH ::




Custom Search