Database Reference
In-Depth Information
notion of other distinguishing evidence to allow for the real-world knowledge
acquisition context in which evidence other than that contained in the data
may be brought to bear upon the rule selection problem.
We present two hypotheses relating to classification rules
W → y
and
Z → y
learned from real-world data such that
W Z
and
NODE
(
W → y, Z → y
).
1.
Pr
(
(
W → y, D
)
− ε
(
true → y, D
)
| < |ε
(
Z → y, D
)
− ε
(
true → y, D
)
|
)
>
Pr
(
(
W → y, D
)
−ε
(
true → y, D
)
| > |ε
(
Z → y, D
)
−ε
(
true → y, D
)
|
). That
is, the error of the more general rule,
, on unseen data will tend to be
closer to the proportion of cases in the domain that do not belong to class
W → y
y
than will the error of the more specific rule,
Z → y
.
W → y, D )
Z → y, D )
2.
Pr
(
(
W → y, D
)
− ε
(
| > |ε
(
Z → y, D
)
− ε
(
|
)
>
W → y, D )
Z → y, D )
Pr
(
(
W → y, D
)
− ε
(
| < |ε
(
Z → y, D
)
− ε
(
|
). That
is, the error of the more specific rule,
, on unseen data will tend to be
closer to the proportion of negative training cases covered by the two rules 1
than will the error of the more general rule,
Z → y
W → y
.
Another way of stating these two hypotheses is that of two rules with identical
empirical and other support,
1. the more general can be expected to exhibit classification error closer to that
of a default rule ,
true → y
, or, in other words, of assuming all cases belong
to the class, and
2. the more specific can be expected to exhibit classification error closer to that
observed on the training data.
It is important to clarify at the outset that we are not claiming that the more
general rule will invariably have closer generalization error to the default rule
and the more specific rule will invariably have closer generalization error to the
observed error on the training data. Rather, we are claiming that relative gener-
ality provides a source of evidence that, in the absence of alternative evidence,
provides reasonable grounds for believing that each of these effects is more likely
than the contrary.
Observation. With simple assumptions, hypotheses (1) and (2) can be shown
to be trivially true given that
D and
D
are idd samples from a single finite
distribution
D
.
Proof.
1. For any rule
X → y
and test set
D
,
ε
(
X → y, D
)=
ε
(
X → y, X
(
D
)), as
X → y
only covers instances
X
(
D
)of
D
.
)= E ( Z→y,Z ( D∩D ))+ E ( Z→y,Z ( D−D ))
|Z ( D ) |
2.
ε
(
Z → y, D
)= E ( W→y,W ( D∩D ))+ E ( W→y,W ( D−D ))
|W ( D ) |
3.
ε
(
W → y, D
4.
Z
(
D
)
⊆ W
(
D
) because
Z
is a specialization of
W
.
1
Recall that both rules have identical empirical support and hence cover the same
training cases.
Search WWH ::




Custom Search