Database Reference
In-Depth Information
Table 12.3 Gini Index for different possible splits of the data from Table 12.2
Condition
left branch right branch Gini Index
# pos #neg # pos #neg
sex=m
4
1
3
2
0.4
ethnicity=native
5
1
1
3
0.32
diploma=none
1
1
5
3
0.48
...
...
...
...
...
...
order to predict the Class . Initially, we start with a tree consisting of only one node,
predicting the majority class '+'. Then, iteratively, we refine the tree by considering
all possible splitting criteria, and evaluating which split is the best. Selecting the
best split is done by observing how the split condition separates the positive class
from the negative class. A split that is better at separating the classes will score
higher on the quality measure. For the dataset of Table 12.1, the different splits are
as follows: The split sex
m would divide the dataset into those instances that sat-
isfy the condition (the left branch), including 4 positive and 1 negative instance,
and those instances that do not satisfy the condition (the right branch), having 3
negative and 2 positive examples. Based on these figures, a degree of impurity can
be computed, in this case, based upon the Gini index (Lerman & Yitzhaki, 1984):
to compute the Gini-index of a split, we first separate the dataset according to the
split criterion. For each partition, the relative frequencies of the positive and nega-
tive class, f + and f respectively, are counted. The Gini-index is then the weighted
average of the Gini-score 1
=
f 2
f 2
(
+ +
)
. If a partition is pure, this implies that either
1and f 2
0and f 2
f + =
=
0, or f + =
=
1. In both cases, the partition contributes
f 2
f 2
1
0 to the gini-score of the split. The contribution of a partition is the
highest if it is maximally impure; i.e., f + =
(
+ +
)=
f 2
=
0
.
5. For the example split sex
=
m ,
2
2
the partition containing the males contributes 1
((
1
/
5
)
+(
4
/
5
)
)=
8
/
25, while
2
2
the partition with the females contributes 1
25. The Gini-
index for the split is now the weighted average over the two partitions, being:
0
((
2
/
5
)
+(
3
/
5
)
)=
12
/
4.
The better the split separates positive from negative, the lower the impurity. From
all splits the one with the lowest impurity is selected. The dataset is split in two parts,
according to the splitting criterion and the procedure continues on both parts until
a stopping condition is met. In (Kamiran et al., 2010b, 2010a) we show how the
splitting criterion can be changed in such a way that not only the impurity with
respect to the class label can be incorporated, but also the level of discrimination
introduced by the split. In particular, we do not only compute how good the split
predicts the class label, but also how good it predicts the sensitive attribute, using the
same gini-index, but now with the relative frequencies of the deprived and favored
communities in the partitions of the split. The good split will then be the one that
achieves a high purity with respect to the class label, but a low purity with respect to
the sensitive attribute. In the running example this means that we want splits that are
good for distinguishing high income from low income people, without separating
.
5
(
8
/
25
)+
0
.
5
(
12
/
25
)=
10
/
25
=
0
.                                          Search WWH ::

Custom Search