Databases Reference
In-Depth Information
attribute values and classes. The test compares the observed distribution among classes
of tuples covered by a rule with the expected distribution that would result if the
rule made predictions at random. We want to assess whether any observed differences
between these two distributions may be attributed to chance. We can use the likelihood
ratio statistic ,
f i
e i
X
Likelihood Ratio D 2
f i log
,
(8.19)
i D1
where m is the number of classes.
For tuples satisfying the rule, f i is the observed frequency of each class i among the
tuples. e i is what we would expect the frequency of each class i to be if the rule made
random predictions. The statistic has a
2 distribution with m 1 degrees of freedom.
The higher the likelihood ratio, the more likely that there is a significant difference in the
number of correct predictions made by our rule in comparison with a “random guessor.”
That is, the performance of our rule is not due to chance. The ratio helps identify rules
with insignificant coverage.
CN2 uses entropy together with the likelihood ratio test, while FOIL's information
gain is used by RIPPER.
Rule Pruning
Learn One Rule does not employ a test set when evaluating rules. Assessments of rule
quality as described previously are made with tuples from the original training data.
These assessments are optimistic because the rules will likely overfit the data. That is,
the rules may perform well on the training data, but less well on subsequent data. To
compensate for this, we can prune the rules. A rule is pruned by removing a conjunct
(attribute test). We choose to prune a rule, R , if the pruned version of R has greater
quality, as assessed on an independent set of tuples. As in decision tree pruning, we refer
to this set as a pruning set . Various pruning strategies can be used such as the pessimistic
pruning approach described in the previous section.
FOIL uses a simple yet effective method. Given a rule, R ,
pos neg
pos C neg ,
FOIL Prune
.
R
/D
(8.20)
where pos and neg are the number of positive and negative tuples covered by R , respec-
tively. This value will increase with the accuracy of R on a pruning set. Therefore, if the
FOIL Prune value is higher for the pruned version of R , then we prune R .
By convention, RIPPER starts with the most recently added conjunct when con-
sidering pruning. Conjuncts are pruned one at a time as long as this results in an
improvement.
 
Search WWH ::




Custom Search