CHANGE DETECTION IN CLASSIFICATION MODELS INDUCED FROM TIME SERIES DATA - Data Mining in Time Series Databases

Database Reference

In-Depth Information

period. The change can be caused by one of the three causes that were

mentioned above or a simple combination of them. Table 1 presents the

possible combinations of significant causes in a given period.

The definition of the variety of possible changes in a data-mining model

is quite a new concept. As noted, several researchers tended to deal with

concept change, population change, activity monitoring, etc. The notion

that all three major causes interact and affect each other is quite new and

it is tested and validated in this work for the first time.

2.3. Statistical Hypothesis Testing

In order to determine whether or not a significant change has occurred in

period

, a set of statistical estimators is presented in this chapter. The

use of these estimators is subject to several conditions:

K

1. Every period contains a sucient amount of training data in order to

rebuild a model for that specific period. The decision of whether a period

contains sucient number of records should be based on the classifica-

tion algorithm in use, the inherent noisiness of the training data, the

acceptable difference between the training and validation error rates,

and so on.

2. The same

algorithm is used in all periods to build the classification

model (e.g., C4.5 or IFN).

3. The same method is used for estimating the validation error rate in all

periods (e.g., 5-fold or 10-fold cross validation, 1/3 holdout, etc.).

DM

Detecting a change in “patterns” (rules) . “Patterns” (rules) define the

relationship between input and target variables. Since massive data streams

are usually involved in building an incremental model, we can safely assume

that the true validation error rate of a given incremental model is accurately

estimated by the previous

K−

1 periods. Therefore, a change in the rules (

R

)

is encountered during period

K

if the validation error of the model

M K− 1

(the model that is based on

D K− 1 ) on the database

D K− 1

is significantly

different from the validation error rate of the model

M K− 1

over

d K

(the

records obtained during period

).

Therefore, the parameter of interest for the statistical hypothesis testing

is the true validation error rate and the null hypothesis is as follows:

K

H 0 :ˆ

e M K− 1 ,K =ˆ

e M K− 1 ,K− 1 ,

H 1 :ˆ

e M K− 1 ,K

=ˆ

e M K− 1 ,K− 1 ,

Data Mining in Time Series Databases

Search WWH ::

Custom Search

Home