Introducing Positive Discrimination in Predictive Models - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

actually obtain one can lead to law suits. In fact, when looking at the data, it is

obvious that the classifier discriminates females since males have a probability

of only 1051 / (1051+2205) = 0.32 to be wrongfully denied a loan, while fe-

males have a probability of 319 / (319+271) = 0.54. Using data mining tools

unmodified for such decision support systems can thus be considered to be a

very dangerous practice.

Removing sensitive information does not help

A commonly used method to avoid potential law suits is to not store any sensitive

information such as gender. The idea is that learning a classifier on data without

this type of information avoids that the classifier's predictions will be based on the

sensitive attribute. This approach, however, does not work. The reason for that is

that there may be other attributes that are highly correlated with the sensitive

attribute. In such a situation, the classifier will use these correlated attributes and

thus discriminate indirectly. This phenomenon was termed the red-lining effect in

Chapter 3. In the banking example, e.g., job occupation is correlated with gender.

Removing gender will only help a bit, as job occupation can be used as a predictor

for this attribute. For example, when we learn a Naive Bayes classifier on the cen-

sus income data-set without gender information 4 and test it on the test-set with

modified threshold, we obtain the following table:

Table 14.7 The gender-predicted income contingency table for the test-set, assigned by a

Naive Bayes classifier learned without gender information

Low income

High income

Female

4900

521

Male

7474

3386

This table shows positive class probabilities of 0.10 and 0.31 for respectively

females and males, and thus a discrimination of 0.21. This does not improve a lot

over the classifier that used the gender information directly. In fact, the false nega-

tives show the same problem as before:

Table 14.8 The no gender information corrected gender-predicted income contingency ta-

ble for high income test cases

Low income

High income

Female

301

289

Male

1079

2177

Thus, even learning a classifier on a data-set without sensitive information can

be dangerous. Removing the sensitive information from a data-set actually makes

the situation worse because data-mining tools will still discriminate, but in a much

more concealed way, and rectifying this situation using discrimination-aware

techniques is extremely difficult without sensitive information.

4 In addition, we replaced “wife” by “husband” in the relationship attribute.

Discrimination and Privacy in the Information Society

Search WWH ::

Custom Search

Home