Database Reference
In-Depth Information
training data. This masking of the problem is so successful that many researchers
appear oblivious to the problem. Our previous work has clearly identified that it
is frequently the case that there exist many variants of the rules typically derived
in machine learning, all of which cover exactly the same training data. Indeed,
one of our previous systems, The Knowledge Factory [3, 4] provides support for
identification and selection between such rule variants.
This paper examines the implications of selecting between such rules on the
basis of their relative generality. We contend that learning biases based on rel-
ative generality can usefully manipulate the expected performance of classifiers
learned from data. The insight that we provide into this issue may assist knowl-
edge engineers make more appropriate selections between alternative rules when
those alternatives derive equal support from the available training data.
We present specific hypotheses relating to reasonable expectations about
classification error for classification rules. We discuss classification rules of the
form
Z → y
, which should be interpreted as all cases that satisfy conditions
Z
. We are interested in learning rules from data. We al-
low that evidence about the likely classification performance of a rule might
come from many sources, including prior knowledge, but, in the machine learn-
ing tradition, are particularly concerned with empirical evidence—evidence
obtained from the performance of the rule on sample (training) data. We con-
sider the learning context in which a rule
belong to class
y
Z → y
is learned from a training set
D =(
x 1 ,y 1 )
x 2 ,y 2 )
x n ,y n ) and is to be applied to a set of previously un-
,
(
,...,
(
seen data called a test set
D
=(
x 1 ,y 1 )
,
(
x 2 ,y 2 )
,...,
(
x m ,y m ). For this enterprise
D and
to be successful,
should be drawn from the same or from related dis-
tributions. For the purposes of the current paper we assume that
D
D and
are
drawn independently at random from the same distribution and acknowledge
that violations of this assumption may affect the effects that we predict.
We utilize the following notation.
D
• Z
(
I
) represents the set of instances in instance set
I
covered by condition
Z
.
• E
(
Z → y, I
) represents the number of instances in instance set
I
that
Z → y
misclassifies (the absolute error).
• ε
(
Z → y, I
) represents the proportion of instance set
I
that
Z → y
misclas-
E ( Z→y,I )
|
sifies (the error) =
.
I
|
• W Z
denotes that the condition
W
is a proper generalization of condition
Z
.
W Z
if and only if the set of descriptions for which
W
is true is a proper
superset of the set of descriptions for which
Z
is true.
• NODE
(
W → y, Z → y
) denotes that there is no other distinguishing ev-
idence between
. This means that there is no avail-
able evidence, other than the relative generality of
W → y
and
Z → y
W
and
Z
, indicating the
likely direction (negative, zero, or positive) of
).
In particular, we require that the empirical evidence be identical. In the
current research the learning systems have access only to empirical evidence
and we assume that
ε
(
W → y, D
)
− ε
(
Z → y, D
D )=
D )
W
(
Z
(
→ NODE
(
W → y, Z → y
). Note that
D )=
D ) does not preclude
W
from covering different test cases
at classification time and hence having different test set error. We utilize the
(
Z
(
W
and
Z
Search WWH ::




Custom Search