Data Mining Techniques for Segmentation - Data Mining Techniques in CRM: Inside Customer Segmentation

Database Reference

In-Depth Information

are evaluated and the predictor that results in the maximum impurity

reduction or equivalently the greatest purity improvement is selected for

partitioning.

The C5.0 algorithm can produce more than two subgroups at each split,

offering non-binary splits. The evaluation of possible splits is based on the ratio of

the information gain, which is an information theory measure.

The Information Gain Measure Used in C5.0

Although a thorough explanation of the information gain is beyond the

scope of this topic, we will try to explain its main concepts. The information

represents the bits needed to describe the outcome category of a particular

node. The information depends on the probabilities (proportions) of the

outcome classes and it can be expressed in bits which can be considered

as the simple Yes/No questions that are needed to determine the outcome

category. The formula for information is as follows:

log 2 (P(t i ))

Information

=−

1

∗

P(t i )

∗

i

where P ( t ) i is the proportion of cases in node t that are in output category i .

So, if, for example, in a node there are three equally balanced category

outcomes, the node information would be

3)] or 1.58 bits.

The information of a split is simply the weighted average of the infor-

mation of the child nodes. Information gained by partitioning the data based

on a selected predictor X is measured by:

−

[log 2 (1

/

INFORMATION GAIN ( because of split on X)

=

INFORMATION (PARENT NODE)

−

INFORMATION ( after splitting on X)

.

The C5.0 algorithm will choose for the split the predictor which results

in the maximum information gain. In fact it uses a normalized format of

the information gain (the information gain ratio) which also fixes a bias in

previous versions of the algorithm toward large and bushy trees.

Both C5.0 and C&RT tend to provide bushy trees. That is why they incor-

porate an integrated pruning procedure for producing smaller trees of equivalent

Data Mining Techniques in CRM: Inside Customer Segmentation

Search WWH ::

Custom Search

Home