Database Reference
In-Depth Information
H
tests
within the available hypothesis space, which generalize the relation-
ship between a set of candidate input variables and the target variable. The
following notation is a general description of the data mining classification
modeling task: Given a database
D
, containing a complete set of records
X
is a vector of candidate input variables (attributes
from the examined phenomenon, which might have some influence on the
target concept), and
=(
A|T
), where
A
is a target variable (i.e. the target concept). Find
the best set of hypothesis
T
within the available hypothesis space, which
generalize the relationship between a set of candidate input variables and
the target variable (e.g. the model
H
) using some Data Mining algorithm.
Each record is regarded as a complete set of conjunction between attributes
and a target concept (or variable), such as
M
X
i
=(
A
1
i
=
a
1
j
(1)
,A
2
i
=
a
2
j
(2)
,...,A
ni
=
a
nj
(
n
)
,T
i
=
t
j
)
.
A
=(
A
1
,A
2
...A
n
) is a known set of attributes of the desired phenomenon
and
) is a known discrete or continuous
target variable or concept. The goal for the classification task is to generate
best set of hypotheses to describe the model
T
(also noted in
DM
literature as
Y
M
using an algorithm
G
.To
simplify, this means, generating the following general equation:
ˆ
M
G
:
A ¬
T.
(1)
It is obvious that not all attributes are proved to be statistically signif-
icant in order to be included in the set of hypothesizes.
In most algorithms, the database
D
is divided into two parts — learning
D
val
) sets. The first is supposed to hold enough
information for assembling a statistically significant and stable model based
on the
D
learn
) and validation (
(
algorithm. The second part is supposed to ensure that the
algorithm performs its goals by validation of the built models on unseen
records. Evaluating the prediction accuracy of any model
DM
M
, which is built
by a classification algorithm
,iscommonlyperformed,byestimatingthe
Validation Error Rate of the examined model.
When the database
G
is not fixed but, alternatively, accumulated over
time, the classification task should be altered: in every period
D
K
,anewset
of records
X
K
is accumulated to the database.
d
K
will be the notation for
the set of records
X
K
that was added in the start of period
K
,and
D
K
will
D
k
=
d
k
. Therefore, given
be the notation for the accumulated database
a database
D
K
, containing a complete set of records
X
K
, generate the best
set of hypothesis
H
K
to describe the accumulated model
M
K
,andanew
question is encountered: is
M
K,G
=
M
K
+1
,G
,inevery
K
=1
,...,k
?