Database Reference
In-Depth Information
n
K−
1
is the number of records in periods 1
,...,K −
1.
x
iK
X
K
isthenumberofrecordsinthe
i
th class of variable
in the
period.
x
iK−
1
isthenumberofrecordsinthe
i
th class of variable
X
in periods
1
,...,K −
1.
X
p
>χ
1
−α
is the number of classes of the tested
variable, then the null hypothesis that the variable
X
's distribution has
been stationary in period
If
(
j −
1), where
j
like in the previous periods is rejected.
The explanation of Pearson's statistical hypothesis testing is provided
in [28].
K
2.4.
Methodology
This section describes the algorithmic usage of the previous estimators:
Inputs:
• G
algorithm used for constructing the classification model
(e.g., C4.5 or IFN).
• M
is the
DM
is the classification model constructed by the
DM
algorithm (e.g., a
decision tree).
• V
is the validation method in use (e.g., 5-fold cross-validation).
• K
is the cumulative number of periods in a data stream.
• α
is the desired significance level for the change detection procedure (the
probability of a false alarm when no actual change is present).
Outputs:
• CD
(
α
) is the error-based change detection estimator (1 -
p-
value).
• XP
) is the Pearson's chi-square estimator of distribution change
(1 -
p
-value).
(
α
2.5.
Change Detection Procedure
Stage 1:
For perio ds
K −
1 build the model
M
K−
1
using the
DM
algorithm
G
.
Define the data set
D
K−
1(val)
.
Count the number of records
n
K−
1
=
|D
K−
1(val)
|
.
Calculate the validation error rate ˆ
e
M
K−
1
,K−
1
according to the valida-
tion method
V
.
Calculate
x
iK−
1
,
n
K−
1
for every input and target variable existing in
periods 1
,...,K −
1.