Databases Reference
In-Depth Information
1. A node containing only one class value (homogeneity measure of zero)
should not be split.
2. A node containing identical input values on all input attributes cannot be
split.
3. A node whose best possible split gain is below a threshold is not worth
splitting.
4. A node split resulting in any child nodes whose size is below a minimum
size threshold should not be split as it is likely to produce a model that does
not generalize.
5. A node split with an acceptable gain with respect to the training dataset but
with negative gain with respect to the validation dataset will produce an
overfit model.
When training, all of the above should be considered. The first two are
mandatory. It is good practice to use a validation dataset to help decide when to
stop splitting.
A decision tree example
Consider the dataset in Table 4.1. A car dealer has collected data on 10 visitors
to its showroom. The classification problem is the construction of a decision tree
that uses Gender, MaritalStatus, and SpeedingCitations columns to predict
CarBuyer. In this example, a decision tree is manually built by following the
previously described algorithm.
The first step is to determine the best attribute on which to split. In calculating
gain, we use the classification error as the homogeneity index. We use this rather
than the Gini index because it is simpler to compute and will allow you to follow
Table 4.1 Car Buyer Data
CarBuyer
Gender
MaritalStatus
SpeedingCitations
Y
M
S
5
Y
F
S
3
N
F
M
0
N
M
M
0
Y
M
S
3
N
F
M
1
Y
F
M
1
N
M
S
3
N
F
M
4
Y
M
M
3
 
Search WWH ::




Custom Search