Database Reference
In-Depth Information
accuracy. In addition, the expectation maximization algorithm has problems con-
verging to a good quality solution with zero discrimination. In fact, during later
iterations, it often finds solutions that are worse both in terms of discrimination and
accuracy than solutions found earlier. This strange behavior of the EM algorithm
still has to be further investigated. For a more detailed overview and discussion of
these results, the reader is referred to (Calders & Verwer, 2010).
14.5 A Note on Positive Discrimination
Although discrimination-aware data-mining is necessary in our opinion, one
should be aware that it not only decreases the accuracy of data-mining, it also
has a high probability to introduce positive discrimination. For instance, if we
repeat the final analysis from Section 3 to results obtained using our first threshold
modifying method (until zero discrimination) on the census income data-set, we
obtain the following counts on people that should get a high income according to
the test-set:
Table 14.9 The gender-predicted income contingency table for high income test cases,
assigned by a Naive Bayes classifier with modified decision thresholds
Low income
High income
Female
101
489
Male
1763
1493
Suddenly, females have a much smaller probability of being falsely denied a
high income. This is an example of positive discrimination, and in some countries
this type of discrimination is also considered illegal. These numbers, however, are
determined using the discriminatory labels in the test-set. The actual difference in
false negatives will be smaller using the true non-discriminatory class values. Un-
fortunately, since we do not know who is being discriminated, we cannot know
exactly how to correct these numbers for this discrimination. We can, however,
make an estimated guess based on the assumption that discrimination occurs at
random, and that the number of positives should remain intact.
Under these assumptions, 690 females with a negative class label in the test-set
should actually have a positive label, and 690 males with a positive label should
actually have a negative label. The probability that a female is already assigned a
positive label is equal to the false positive probability, which is 0.1683 (813 out of
4018). Thus, 690•0.1683=116 discriminated females get a positive label, and 574
discriminated females remain. Since these should get a positive label, these counts
are added to the true and false negatives. For the male counts, some positives
should actually be negatives. The false positive probability for males is 0.5415
(1763 out of 3256). Thus, 690•0.5415=374 favored males get a negative label, and
316 favored males remain. Since these counts should actually be negative, we sub-
tract them from the counts in the table. This results in the following table:
 
Search WWH ::




Custom Search