Database Reference
In-Depth Information
actually obtain one can lead to law suits. In fact, when looking at the data, it is
obvious that the classifier discriminates females since males have a probability
of only 1051 / (1051+2205) = 0.32 to be wrongfully denied a loan, while fe-
males have a probability of 319 / (319+271) = 0.54. Using data mining tools
unmodified for such decision support systems can thus be considered to be a
very dangerous practice.
Removing sensitive information does not help
A commonly used method to avoid potential law suits is to not store any sensitive
information such as gender. The idea is that learning a classifier on data without
this type of information avoids that the classifier's predictions will be based on the
sensitive attribute. This approach, however, does not work. The reason for that is
that there may be other attributes that are highly correlated with the sensitive
attribute. In such a situation, the classifier will use these correlated attributes and
thus discriminate indirectly. This phenomenon was termed the red-lining effect in
Chapter 3. In the banking example, e.g., job occupation is correlated with gender.
Removing gender will only help a bit, as job occupation can be used as a predictor
for this attribute. For example, when we learn a Naive Bayes classifier on the cen-
sus income data-set without gender information 4 and test it on the test-set with
modified threshold, we obtain the following table:
Table 14.7 The gender-predicted income contingency table for the test-set, assigned by a
Naive Bayes classifier learned without gender information
Low income
High income
Female
4900
521
Male
7474
3386
This table shows positive class probabilities of 0.10 and 0.31 for respectively
females and males, and thus a discrimination of 0.21. This does not improve a lot
over the classifier that used the gender information directly. In fact, the false nega-
tives show the same problem as before:
Table 14.8 The no gender information corrected gender-predicted income contingency ta-
ble for high income test cases
Low income
High income
Female
301
289
Male
1079
2177
Thus, even learning a classifier on a data-set without sensitive information can
be dangerous. Removing the sensitive information from a data-set actually makes
the situation worse because data-mining tools will still discriminate, but in a much
more concealed way, and rectifying this situation using discrimination-aware
techniques is extremely difficult without sensitive information.
4 In addition, we replaced “wife” by “husband” in the relationship attribute.
 
Search WWH ::




Custom Search