Database Reference
In-Depth Information
are evaluated and the predictor that results in the maximum impurity
reduction or equivalently the greatest purity improvement is selected for
partitioning.
The C5.0 algorithm can produce more than two subgroups at each split,
offering non-binary splits. The evaluation of possible splits is based on the ratio of
the information gain, which is an information theory measure.
The Information Gain Measure Used in C5.0
Although a thorough explanation of the information gain is beyond the
scope of this topic, we will try to explain its main concepts. The information
represents the bits needed to describe the outcome category of a particular
node. The information depends on the probabilities (proportions) of the
outcome classes and it can be expressed in bits which can be considered
as the simple Yes/No questions that are needed to determine the outcome
category. The formula for information is as follows:
log 2 (P(t i ))
Information
=−
1
P(t i )
i
where P ( t ) i is the proportion of cases in node t that are in output category i .
So, if, for example, in a node there are three equally balanced category
outcomes, the node information would be
3)] or 1.58 bits.
The information of a split is simply the weighted average of the infor-
mation of the child nodes. Information gained by partitioning the data based
on a selected predictor X is measured by:
[log 2 (1
/
INFORMATION GAIN ( because of split on X)
=
INFORMATION (PARENT NODE)
INFORMATION ( after splitting on X)
.
The C5.0 algorithm will choose for the split the predictor which results
in the maximum information gain. In fact it uses a normalized format of
the information gain (the information gain ratio) which also fixes a bias in
previous versions of the algorithm toward large and bushy trees.
Both C5.0 and C&RT tend to provide bushy trees. That is why they incor-
porate an integrated pruning procedure for producing smaller trees of equivalent
Search WWH ::




Custom Search