Database Reference
In-Depth Information
commonly performed by estimating the validation error rate of the examined
model. This is calculated by the following equation:
¦
H
(
G
)
is
wrong
T
(
predicted
)
z
T
(
actual
)
X
val
(4.1)
X
val
N
(
nember
of
records
in the
validatio
n
set)
X
val
When the database D is not fixed but is accumulated over time, the data-
mining task should be altered: In every period K , a new set of records X K is added
to the database; d K is the set of records X K that was added in the start of period K ,
and D K is the accumulated database K
1
D
d
. Therefore, Given a database
K
k
k
D K containing a complete set of records X K , generate the best set of hypotheses
H K to describe the accumulated model M. At the end of time period K+ 1, a new
question is encountered: Is M K , G = M K +1, G , for every K ?
As noted before, several methods dealing with this problem have already been
proposed by researchers. Most methods have dealt with “how the model M can be
updated efficiently when a new period K is encountered” or “how we can adapt to
the time factor,” rather than asking the following questions:
1.
Was the model significantly changed during the period K ?
2.
What was the nature of the change?
3.
Should we consider several of the past periods as redundant or not required
for an algorithm G to generate a better model M ?
Hence, the objective is to define and evaluate a change-detection methodology
for identifying a significant change that happened during period K in a data-
mining model, which was incrementally built in periods 1 to
K
1
, based on the
data that was accumulated during period K .
4.2.2 Variety of Changes
There are various significant changes that might occur when building the model M
based on the algorithm G. There are several possible causes for significant
changes in a data-mining classification model:
1. A change in the distribution of one or more of the candidate attributes (A). For
example, if a database in periods 1 to K -1, consists of a 45% male and 55%
female examples and in period K all records of male samples.
2. A change in the distribution of the target variable (T). For example, consider
the case of examining the rate of failures in a seminar examination based on the
characteristics of the students in the course in past consecutive years. If in 1999
the average was 20% and in 2000 it was 40%, then a change in the target
distribution has occurred.
Search WWH ::




Custom Search