Generality Is Predictive of Prediction Accuracy - Data Mining: Theory, Methodology, Techniques, and Applications

Database Reference

In-Depth Information

training data. This masking of the problem is so successful that many researchers

appear oblivious to the problem. Our previous work has clearly identified that it

is frequently the case that there exist many variants of the rules typically derived

in machine learning, all of which cover exactly the same training data. Indeed,

one of our previous systems, The Knowledge Factory [3, 4] provides support for

identification and selection between such rule variants.

This paper examines the implications of selecting between such rules on the

basis of their relative generality. We contend that learning biases based on rel-

ative generality can usefully manipulate the expected performance of classifiers

learned from data. The insight that we provide into this issue may assist knowl-

edge engineers make more appropriate selections between alternative rules when

those alternatives derive equal support from the available training data.

We present specific hypotheses relating to reasonable expectations about

classification error for classification rules. We discuss classification rules of the

form

Z → y

, which should be interpreted as all cases that satisfy conditions

. We are interested in learning rules from data. We al-

low that evidence about the likely classification performance of a rule might

come from many sources, including prior knowledge, but, in the machine learn-

ing tradition, are particularly concerned with empirical evidence—evidence

obtained from the performance of the rule on sample (training) data. We con-

sider the learning context in which a rule

belong to class

Z → y

is learned from a training set

D =(

x 1 ,y 1 )

x 2 ,y 2 )

x n ,y n ) and is to be applied to a set of previously un-

(

,...,

(

seen data called a test set

x 1 ,y 1 )

(

x 2 ,y 2 )

,...,

(

x m ,y m ). For this enterprise

D and

to be successful,

should be drawn from the same or from related dis-

tributions. For the purposes of the current paper we assume that

D and

are

drawn independently at random from the same distribution and acknowledge

that violations of this assumption may affect the effects that we predict.

We utilize the following notation.

• Z

(

) represents the set of instances in instance set

covered by condition

• E

(

Z → y, I

) represents the number of instances in instance set

that

Z → y

misclassifies (the absolute error).

• ε

(

Z → y, I

) represents the proportion of instance set

that

Z → y

misclas-

E ( Z→y,I )

sifies (the error) =

• W Z

denotes that the condition

is a proper generalization of condition

W Z

if and only if the set of descriptions for which

is true is a proper

superset of the set of descriptions for which

is true.

• NODE

(

W → y, Z → y

) denotes that there is no other distinguishing ev-

idence between

. This means that there is no avail-

able evidence, other than the relative generality of

W → y

and

Z → y

and

, indicating the

likely direction (negative, zero, or positive) of

In particular, we require that the empirical evidence be identical. In the

current research the learning systems have access only to empirical evidence

and we assume that

(

W → y, D

)

− ε

(

Z → y, D

D )=

D )

(

→ NODE

(

W → y, Z → y

). Note that

D )=

D ) does not preclude

from covering different test cases

at classification time and hence having different test set error. We utilize the

(

and

Data Mining: Theory, Methodology, Techniques, and Applications

Search WWH ::

Custom Search

Home