Classification: Basic Concepts - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

attribute values and classes. The test compares the observed distribution among classes

of tuples covered by a rule with the expected distribution that would result if the

rule made predictions at random. We want to assess whether any observed differences

between these two distributions may be attributed to chance. We can use the likelihood

ratio statistic ,

f i

e i

X

Likelihood Ratio D 2

f i log

,

(8.19)

i D1

where m is the number of classes.

For tuples satisfying the rule, f i is the observed frequency of each class i among the

tuples. e i is what we would expect the frequency of each class i to be if the rule made

random predictions. The statistic has a

2 distribution with m 1 degrees of freedom.

The higher the likelihood ratio, the more likely that there is a significant difference in the

number of correct predictions made by our rule in comparison with a “random guessor.”

That is, the performance of our rule is not due to chance. The ratio helps identify rules

with insignificant coverage.

CN2 uses entropy together with the likelihood ratio test, while FOIL's information

gain is used by RIPPER.

Rule Pruning

Learn One Rule does not employ a test set when evaluating rules. Assessments of rule

quality as described previously are made with tuples from the original training data.

These assessments are optimistic because the rules will likely overfit the data. That is,

the rules may perform well on the training data, but less well on subsequent data. To

compensate for this, we can prune the rules. A rule is pruned by removing a conjunct

(attribute test). We choose to prune a rule, R , if the pruned version of R has greater

quality, as assessed on an independent set of tuples. As in decision tree pruning, we refer

to this set as a pruning set . Various pruning strategies can be used such as the pessimistic

pruning approach described in the previous section.

FOIL uses a simple yet effective method. Given a rule, R ,

pos neg

pos C neg ,

FOIL Prune

.

R

/D

(8.20)

where pos and neg are the number of positive and negative tuples covered by R , respec-

tively. This value will increase with the accuracy of R on a pruning set. Therefore, if the

FOIL Prune value is higher for the pruned version of R , then we prune R .

By convention, RIPPER starts with the most recently added conjunct when con-

sidering pruning. Conjuncts are pruned one at a time as long as this results in an

improvement.

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home