Classification: Basic Concepts - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

“But how can we compute the information gain of an attribute that is continuous-

valued, unlike in the example?” Suppose, instead, that we have an attribute A that is

continuous-valued, rather than discrete-valued. (For example, suppose that instead

of the discretized version of age from the example, we have the raw values for this

attribute.) For such a scenario, we must determine the “best” split-point for A , where

the split-point is a threshold on A .

We first sort the values of A in increasing order. Typically, the midpoint between each

pair of adjacent values is considered as a possible split-point. Therefore, given v values

of A , then v 1 possible splits are evaluated. For example, the midpoint between the

values a i and a i C1 of A is

a i C a i C 1

2

.

(8.4)

If the values of A are sorted in advance, then determining the best split for A requires

only one pass through the values. For each possible split-point for A , we evaluate

Info A .

, where the number of partitions is two, that is, v D 2 (or j D 1, 2) in Eq. (8.2).

The point with the minimum expected information requirement for A is selected as the

split point for A . D 1 is the set of tuples in D satisfying A split point , and D 2 is the set

of tuples in D satisfying A

D

/

>

split point .

Gain Ratio

The information gain measure is biased toward tests with many outcomes. That is, it

prefers to select attributes having a large number of values. For example, consider an

attribute that acts as a unique identifier such as product ID . A split on product ID would

result in a large number of partitions (as many as there are values), each one containing

just one tuple. Because each partition is pure, the information required to classify data

set D based on this partitioning would be Info product ID .

/D 0. Therefore, the informa-

tion gained by partitioning on this attribute is maximal. Clearly, such a partitioning is

useless for classification.

C4.5, a successor of ID3, uses an extension to information gain known as gain ratio ,

which attempts to overcome this bias. It applies a kind of normalization to information

gain using a “split information” value defined analogously with Info

D

.

D

/

as

j D j j

j D j

v X

j D j j

j D j log 2

SplitInfo A .

D

/D

.

(8.5)

j D1

This value represents the potential information generated by splitting the training

data set, D , into v partitions, corresponding to the v outcomes of a test on attribute A .

Note that, for each outcome, it considers the number of tuples having that outcome

with respect to the total number of tuples in D . It differs from information gain, which

measures the information with respect to classification that is acquired based on the

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home