Data Mining Techniques for Segmentation - Data Mining Techniques in CRM: Inside Customer Segmentation

Database Reference

In-Depth Information

On the other hand, when specific statistical assumptions are met, traditional

statistical techniques can yield comparable or even better results than decision

trees. Moreover, in decision trees, the model is represented by a set of rules, the

number of whichmay be quite large. This fact may complicate understanding of the

model, particularly in the case of complex and multilevel partitions. In traditional

statistical techniques (like logistic regression), the inputs-output association is

represented by one or a few overall equations and with respective coefficients

which denote the effect of each predictor on the output.

ONE GOAL, DIFFERENT DECISION TREE ALGORITHMS: C&RT, C5.0,

AND CHAID

There are various decision tree algorithms with different tree growth methods.

All of them have the same goal of maximizing the total purity by identifying

sub-segments dominated by a specific outcome. However, they differ according to

the measure they use for selecting the optimal split.

Classification and Regression Trees (C&RT) produce splits of two child nodes,

also referred to as binary splits. They typically incorporate an impurity measure

named Gini for the splits. The Gini coefficient is a measure of dispersion that

depends on the distribution of the outcome categories. It ranges from 0 to 1 and

has a maximum value (worst case) in the case of balanced distributions of the

outcome categories and a minimum value (best case) when all records of a node

are concentrated in a single category.

The Gini Impurity Measure Used in C&RT

The formula for the Gini measure used in C&RT models is as follows:

P(t i ) 2

Gini

=

1

−

i

where P ( t i ) is the proportion of cases in node t that are in output category i .

In the case of an output field with three categories, a node with a

balanced outcome distribution of records (1/3, 1/3, 1/3) has a Gini value of

0.667. On the contrary, a pure node with all records assigned to a single

category and a distribution of (1, 0, 0) gets a Gini value of 0.

The Gini impurity measure for a specific split is the weighted average

of the resulting child nodes. Consequently, a split which results in two

nodes of equal size with respective Gini measures of 0.4 and 0.2 has a

total Gini value of 0.5*0.4

+

0.5*0.2

=

0.3. At each branch, all predictors

Search WWH ::

Custom Search

Home