Prediction Algorithms for Data Mining - Visual Data Mining: The VisMiner Approach

Databases Reference

In-Depth Information

1. A node containing only one class value (homogeneity measure of zero)

should not be split.

2. A node containing identical input values on all input attributes cannot be

split.

3. A node whose best possible split gain is below a threshold is not worth

splitting.

4. A node split resulting in any child nodes whose size is below a minimum

size threshold should not be split as it is likely to produce a model that does

not generalize.

5. A node split with an acceptable gain with respect to the training dataset but

with negative gain with respect to the validation dataset will produce an

overfit model.

When training, all of the above should be considered. The first two are

mandatory. It is good practice to use a validation dataset to help decide when to

stop splitting.

A decision tree example

Consider the dataset in Table 4.1. A car dealer has collected data on 10 visitors

to its showroom. The classification problem is the construction of a decision tree

that uses Gender, MaritalStatus, and SpeedingCitations columns to predict

CarBuyer. In this example, a decision tree is manually built by following the

previously described algorithm.

The first step is to determine the best attribute on which to split. In calculating

gain, we use the classification error as the homogeneity index. We use this rather

than the Gini index because it is simpler to compute and will allow you to follow

Table 4.1 Car Buyer Data

CarBuyer

Gender

MaritalStatus

SpeedingCitations

Y

M

S

5

Y

F

S

3

N

F

M

0

N

M

0

Y

M

S

3

N

F

M

1

Y

F

M

1

N

M

S

3

N

F

M

4

Y

M

3

Visual Data Mining: The VisMiner Approach

Search WWH ::

Custom Search

Home