Database Reference
In-Depth Information
commonly performed by estimating the validation error rate of the examined
model. This is calculated by the following equation:
¦
H
(
G
)
is
wrong
T
(
predicted
)
z
T
(
actual
)
X
val
(4.1)
X
val
N
(
nember
of
records
in the
validatio
n
set)
X
val
When the database D is not fixed but is accumulated over time, the data-
mining task should be altered: In every period
K
, a new set of records X
K
is added
to the database; d
K
is the set of records X
K
that was added in the start of period
K
,
and D
K
is the accumulated database
K
1
D
d
. Therefore, Given a database
K
k
k
D
K
containing a complete set of records X
K
, generate the best set of hypotheses
H
K
to describe the accumulated model M. At the end of time period
K+
1, a new
question is encountered: Is M
K
,
G
= M
K
+1,
G
, for every
K
?
As noted before, several methods dealing with this problem have already been
proposed by researchers. Most methods have dealt with “how the model M can be
updated efficiently when a new period
K
is encountered” or “how we can adapt to
the time factor,” rather than asking the following questions:
1.
Was the model significantly changed during the period
K
?
2.
What was the nature of the change?
3.
Should we consider several of the past periods as redundant or not required
for an algorithm
G
to generate a better model
M
?
Hence, the objective is to define and evaluate a change-detection methodology
for identifying a significant change that happened during period
K
in a data-
mining model, which was incrementally built in periods 1 to
K
1
, based on the
data that was accumulated during period
K
.
4.2.2 Variety of Changes
There are various significant changes that might occur when building the model M
based on the algorithm G. There are several possible causes for significant
changes in a data-mining classification model:
1. A change in the distribution of one or more of the candidate attributes (A). For
example, if a database in periods 1 to
K
-1, consists of a 45% male and 55%
female examples and in period
K
all records of male samples.
2. A change in the distribution of the target variable (T). For example, consider
the case of examining the rate of failures in a seminar examination based on the
characteristics of the students in the course in past consecutive years. If in 1999
the average was 20% and in 2000 it was 40%, then a change in the target
distribution has occurred.
Search WWH ::
Custom Search