Introducing Positive Discrimination in Predictive Models - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

These functions can all be easily estimated from the data-set by counting how

many times each attribute value occurs together with each class attribute value.

When we want to determine for instance the probability that a female with high

school education receives a high income, we use the total probability function to

compute and normalize the probability of these values together with a high and a

low income:

P(high income,female,high school) = 0.3•0.33•0.33 = 0.033

P(low income,female,high school) = 0.7•0.43•0.57 = 0.172

P(high income|female,high school) =

P(high income,female,high school)/P(female,high school) =

0.033/(0.033+0.172) = 0.16

Since this is less than 0.5, we estimate that a female with a high school education

will not receive a high income. Note that this is estimated based on the assumption

that education and gender are independent given the income class.

The above example describes the basic version of a Naive Bayes classifier. Most

implementations use Gaussian distributions for continuous attributes and smooth-

ing methods to avoid zero probabilities (Bishop, 2006). In addition, the decision

threshold (0.5 in the example) can often be modified. Although using a threshold

of 0.5 makes sense intuitively, it is common practice to modify it depending on the

situational needs, for instance to increase accuracy, or decrease the number of

false positives (Lachiche & Flach, 2003).

14.3 The Problem of Discrimination in Data-Mining

In Chapter 3, it is explained how discrimination may occur, even if the training da-

ta is non-discriminatory. In this section we will now show specifically for a Naive

Bayes classifier how using an off-the-shelf Naive Bayes classifier can lead to dis-

criminatory results.

We motivate our methods using examples of the discriminatory results that are

obtained when using a Naive Bayes classifier 2 on the census income data-set 3 .

From this data set we try to learn a Naive Bayes classifier that can be used to de-

cide whether a new individual should be classified as having a high or a low in-

come. Historically, this decision has been biased towards the male sex, as can be

seen in the following table:

Table 14.1 The contingency table of the income and gender attributes

Low income

High income

Female

9592

1179

Male

15128

6662

2 We use the Naïve Bayes classifier from the e1071 package in the R statistical toolbox

(Dimitriadou et al., 2008).

3 http://archive.ics.uci.edu/ml/datasets/Census+Income

Discrimination and Privacy in the Information Society

Search WWH ::

Custom Search

Home