Techniques for Discrimination-Free Predictive Models - Discrimination and Privacy in the Information Society - page 224

Database Reference

In-Depth Information

Table 12.3 Gini Index for different possible splits of the data from Table 12.2

Condition

left branch right branch Gini Index

# pos #neg # pos #neg

sex=m

4

1

3

2

0.4

ethnicity=native

5

1

1

3

0.32

diploma=none

1

1

5

3

0.48

...

...

...

...

...

...

order to predict the Class . Initially, we start with a tree consisting of only one node,

predicting the majority class '+'. Then, iteratively, we refine the tree by considering

all possible splitting criteria, and evaluating which split is the best. Selecting the

best split is done by observing how the split condition separates the positive class

from the negative class. A split that is better at separating the classes will score

higher on the quality measure. For the dataset of Table 12.1, the different splits are

as follows: The split sex

m would divide the dataset into those instances that sat-

isfy the condition (the left branch), including 4 positive and 1 negative instance,

and those instances that do not satisfy the condition (the right branch), having 3

negative and 2 positive examples. Based on these figures, a degree of impurity can

be computed, in this case, based upon the Gini index (Lerman & Yitzhaki, 1984):

to compute the Gini-index of a split, we first separate the dataset according to the

split criterion. For each partition, the relative frequencies of the positive and nega-

tive class, f + and f − respectively, are counted. The Gini-index is then the weighted

average of the Gini-score 1

=

f 2

f 2

− (

+ +

− )

. If a partition is pure, this implies that either

1and f 2

0and f 2

f + =

− =

0, or f + =

− =

1. In both cases, the partition contributes

f 2

f 2

1

0 to the gini-score of the split. The contribution of a partition is the

highest if it is maximally impure; i.e., f + =

− (

+ +

− )=

f 2

− =

0

.

5. For the example split sex

=

m ,

2

2

the partition containing the males contributes 1

− ((

1

/

5

)

+(

4

/

5

)

)=

8

/

25, while

2

2

the partition with the females contributes 1

25. The Gini-

index for the split is now the weighted average over the two partitions, being:

0

− ((

2

/

5

)

+(

3

/

5

)

)=

12

/

4.

The better the split separates positive from negative, the lower the impurity. From

all splits the one with the lowest impurity is selected. The dataset is split in two parts,

according to the splitting criterion and the procedure continues on both parts until

a stopping condition is met. In (Kamiran et al., 2010b, 2010a) we show how the

splitting criterion can be changed in such a way that not only the impurity with

respect to the class label can be incorporated, but also the level of discrimination

introduced by the split. In particular, we do not only compute how good the split

predicts the class label, but also how good it predicts the sensitive attribute, using the

same gini-index, but now with the relative frequencies of the deprived and favored

communities in the partitions of the split. The good split will then be the one that

achieves a high purity with respect to the class label, but a low purity with respect to

the sensitive attribute. In the running example this means that we want splits that are

good for distinguishing high income from low income people, without separating

.

5

(

8

/

25

)+

0

.

5

(

12

/

25

)=

10

/

25

=

0

.

Next Page

Discrimination and Privacy in the Information Society

Search WWH ::

Custom Search

Home