Database Reference
In-Depth Information
H
tests
within the available hypothesis space, which generalize the relation-
ship between a set of candidate input variables and the target variable. The
following notation is a general description of the data mining classification
modeling task: Given a database
D
, containing a complete set of records
X
is a vector of candidate input variables (attributes
from the examined phenomenon, which might have some influence on the
target concept), and
=(
A|T
), where
A
is a target variable (i.e. the target concept). Find
the best set of hypothesis
T
within the available hypothesis space, which
generalize the relationship between a set of candidate input variables and
the target variable (e.g. the model
H
) using some Data Mining algorithm.
Each record is regarded as a complete set of conjunction between attributes
and a target concept (or variable), such as
M
X i =(
A 1 i =
a 1 j (1)
,A 2 i =
a 2 j (2)
,...,A ni =
a nj (
n
)
,T i =
t j )
.
A
=(
A 1 ,A 2 ...A n ) is a known set of attributes of the desired phenomenon
and
) is a known discrete or continuous
target variable or concept. The goal for the classification task is to generate
best set of hypotheses to describe the model
T
(also noted in
DM
literature as
Y
M
using an algorithm
G
.To
simplify, this means, generating the following general equation:
ˆ
M G
:
A ¬
T.
(1)
It is obvious that not all attributes are proved to be statistically signif-
icant in order to be included in the set of hypothesizes.
In most algorithms, the database
D
is divided into two parts — learning
D val ) sets. The first is supposed to hold enough
information for assembling a statistically significant and stable model based
on the
D learn ) and validation (
(
algorithm. The second part is supposed to ensure that the
algorithm performs its goals by validation of the built models on unseen
records. Evaluating the prediction accuracy of any model
DM
M
, which is built
by a classification algorithm
,iscommonlyperformed,byestimatingthe
Validation Error Rate of the examined model.
When the database
G
is not fixed but, alternatively, accumulated over
time, the classification task should be altered: in every period
D
K
,anewset
of records
X K
is accumulated to the database.
d K
will be the notation for
the set of records
X K that was added in the start of period
K
,and
D K will
D k = d k . Therefore, given
be the notation for the accumulated database
a database
D K , containing a complete set of records
X K , generate the best
set of hypothesis
H K
to describe the accumulated model
M K ,andanew
question is encountered: is
M K,G =
M K +1 ,G ,inevery
K
=1
,...,k
?
Search WWH ::




Custom Search