Database Reference
In-Depth Information
period. The change can be caused by one of the three causes that were
mentioned above or a simple combination of them. Table 1 presents the
possible combinations of significant causes in a given period.
The definition of the variety of possible changes in a data-mining model
is quite a new concept. As noted, several researchers tended to deal with
concept change, population change, activity monitoring, etc. The notion
that all three major causes interact and affect each other is quite new and
it is tested and validated in this work for the first time.
2.3. Statistical Hypothesis Testing
In order to determine whether or not a significant change has occurred in
period
, a set of statistical estimators is presented in this chapter. The
use of these estimators is subject to several conditions:
K
1. Every period contains a sucient amount of training data in order to
rebuild a model for that specific period. The decision of whether a period
contains sucient number of records should be based on the classifica-
tion algorithm in use, the inherent noisiness of the training data, the
acceptable difference between the training and validation error rates,
and so on.
2. The same
algorithm is used in all periods to build the classification
model (e.g., C4.5 or IFN).
3. The same method is used for estimating the validation error rate in all
periods (e.g., 5-fold or 10-fold cross validation, 1/3 holdout, etc.).
DM
Detecting a change in “patterns” (rules) . “Patterns” (rules) define the
relationship between input and target variables. Since massive data streams
are usually involved in building an incremental model, we can safely assume
that the true validation error rate of a given incremental model is accurately
estimated by the previous
K−
1 periods. Therefore, a change in the rules (
R
)
is encountered during period
K
if the validation error of the model
M K− 1
(the model that is based on
D K− 1 ) on the database
D K− 1
is significantly
different from the validation error rate of the model
M K− 1
over
d K
(the
records obtained during period
).
Therefore, the parameter of interest for the statistical hypothesis testing
is the true validation error rate and the null hypothesis is as follows:
K
H 0
e M K− 1 ,K
e M K− 1 ,K− 1 ,
H 1
e M K− 1 ,K
e M K− 1 ,K− 1 ,
Search WWH ::




Custom Search