Discretization - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

each containing c a and C b number of classes. The Mantaras distance between two

partitions due to a single cut point is given below.

(

S a |

S b ) +

(

S b |

S a )

Dist

(

S a ,

S b ) =

(

S a ∩

S b )

Since,

(

S b |

S a ) =

(

S b ∩

S a ) −

(

S a )

(

S a ) +

(

S b )

Dist

(

S a ,

S b ) =

−

(

S a ∩

S b )

where,

c a

(

S a ) =−

S i log 2 S i

c b

(

S b ) =−

S j log 2 S j

c a

c b

(

S a ∩

S b ) =−

S ij log 2 S ij

= |

C i |

S i

C i |=

total count of class i

total number of instances

S ij =

S i ×

S j

It chooses the cut point that minimizes the distance. As a stopping criterion, it uses

the minimum description length discussed previously to determine whether more cut

points should be added.

PKID [ 122 ]

In order to maintain a low bias and a low variance in a learning scheme, it is recom-

mendable to increase both the interval frequency and the number of intervals as the

amount of training data increases too. A good way to achieve this is to set interval

frequency and interval number equally proportional to the amount of training data.

This the main purpose of proportional discretization (PKID).

When discretizing a continuous attribute for which there are N instances, sup-

posing that the desired interval frequency is s and the desired interval number is t ,

PKID calculates s and t by the following expressions:

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home