Evaluation of Classification Trees - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

Table 4.4

Characteristics of Qrecall and Hit-rate.

Parameter

Hit-rate

Qrecall

Function

increasing/decreasing

Non-monotonic

Monotonically increasing

End point

Proportion of positive

samples in the set

1

Sensitivity of the

measures value to

positive instances

Very sensitive to positive

instances at the top of

the list. Less sensitive

on going down to the

bottom of the list.

Same sensitivity to

positive instances in all

places in the list.

Effect of negative class on

the measure

A negative instance

affects the measure and

causeitsvalueto

decrease.

A negative instance does

not affect the measure.

Range

0 ≤ Hit-rate ≤ 1

0 ≤ Qrecall ≤ 1

random guess, without any learning) is a linear line (or semi-linear because

values are discrete) which starts at 0 (for zero quota size) and ends in 1.

Suppose now that a model gave an optimum prediction, meaning that

all positive instances are located at the head of the list and below them, all

the negative instances. In this case, the Qrecall curve climbs linearly until

a value of 1 is achieved at point, n + ( n + = number of positive samples).

From that point, any quota that has a size bigger than n + , fully extracts

test set potential and the value 1 is kept until the end of the list.

Note that a “good model”, which outperforms random classification,

though not an optimum one, will fall “on average” between these two curves.

It may drop sometimes below the random curve but generally, more area is

delineated between the “good model” curve and the random curve, above

the latter than below it. If the opposite is true then the model is a “bad

model” that does worse than a random guess.

The last observation leads us to consider a measure that evaluates

the performance of a model by summing the areas delineated between the

Qrecall curve of the examined model and the Qrecall curve of a random

model (which is linear). Areas above the linear curve are added and areas

below the linear curve are subtracted. The areas themselves are calculated

by subtracting the Qrecall of a random classification from the Qrecall of the

model's classification in every point as shown in Figure 4.7. The areas where

the model performed better than a random guess increase the measure's

value while the areas where the model performed worse than a random guess

decrease it. If the last total computed area is divided in the area delineated

Data Mining with Decision Trees: Theory and Applications

Search WWH ::

Custom Search

Home