Database Reference
In-Depth Information
On the other hand, when specific statistical assumptions are met, traditional
statistical techniques can yield comparable or even better results than decision
trees. Moreover, in decision trees, the model is represented by a set of rules, the
number of whichmay be quite large. This fact may complicate understanding of the
model, particularly in the case of complex and multilevel partitions. In traditional
statistical techniques (like logistic regression), the inputs-output association is
represented by one or a few overall equations and with respective coefficients
which denote the effect of each predictor on the output.
ONE GOAL, DIFFERENT DECISION TREE ALGORITHMS: C&RT, C5.0,
AND CHAID
There are various decision tree algorithms with different tree growth methods.
All of them have the same goal of maximizing the total purity by identifying
sub-segments dominated by a specific outcome. However, they differ according to
the measure they use for selecting the optimal split.
Classification and Regression Trees (C&RT) produce splits of two child nodes,
also referred to as binary splits. They typically incorporate an impurity measure
named Gini for the splits. The Gini coefficient is a measure of dispersion that
depends on the distribution of the outcome categories. It ranges from 0 to 1 and
has a maximum value (worst case) in the case of balanced distributions of the
outcome categories and a minimum value (best case) when all records of a node
are concentrated in a single category.
The Gini Impurity Measure Used in C&RT
The formula for the Gini measure used in C&RT models is as follows:
P(t i ) 2
Gini
=
1
i
where P ( t i ) is the proportion of cases in node t that are in output category i .
In the case of an output field with three categories, a node with a
balanced outcome distribution of records (1/3, 1/3, 1/3) has a Gini value of
0.667. On the contrary, a pure node with all records assigned to a single
category and a distribution of (1, 0, 0) gets a Gini value of 0.
The Gini impurity measure for a specific split is the weighted average
of the resulting child nodes. Consequently, a split which results in two
nodes of equal size with respective Gini measures of 0.4 and 0.2 has a
total Gini value of 0.5*0.4
+
0.5*0.2
=
0.3. At each branch, all predictors
Search WWH ::




Custom Search