Database Reference
In-Depth Information
These functions can all be easily estimated from the data-set by counting how
many times each attribute value occurs together with each class attribute value.
When we want to determine for instance the probability that a female with high
school education receives a high income, we use the total probability function to
compute and normalize the probability of these values together with a high and a
low income:
P(high income,female,high school) = 0.3•0.33•0.33 = 0.033
P(low income,female,high school) = 0.7•0.43•0.57 = 0.172
P(high income|female,high school) =
P(high income,female,high school)/P(female,high school) =
0.033/(0.033+0.172) = 0.16
Since this is less than 0.5, we estimate that a female with a high school education
will not receive a high income. Note that this is estimated based on the assumption
that education and gender are independent given the income class.
The above example describes the basic version of a Naive Bayes classifier. Most
implementations use Gaussian distributions for continuous attributes and smooth-
ing methods to avoid zero probabilities (Bishop, 2006). In addition, the decision
threshold (0.5 in the example) can often be modified. Although using a threshold
of 0.5 makes sense intuitively, it is common practice to modify it depending on the
situational needs, for instance to increase accuracy, or decrease the number of
false positives (Lachiche & Flach, 2003).
14.3 The Problem of Discrimination in Data-Mining
In Chapter 3, it is explained how discrimination may occur, even if the training da-
ta is non-discriminatory. In this section we will now show specifically for a Naive
Bayes classifier how using an off-the-shelf Naive Bayes classifier can lead to dis-
criminatory results.
We motivate our methods using examples of the discriminatory results that are
obtained when using a Naive Bayes classifier 2 on the census income data-set 3 .
From this data set we try to learn a Naive Bayes classifier that can be used to de-
cide whether a new individual should be classified as having a high or a low in-
come. Historically, this decision has been biased towards the male sex, as can be
seen in the following table:
Table 14.1 The contingency table of the income and gender attributes
Low income
High income
Female
9592
1179
Male
15128
6662
2 We use the Naïve Bayes classifier from the e1071 package in the R statistical toolbox
(Dimitriadou et al., 2008).
3 http://archive.ics.uci.edu/ml/datasets/Census+Income
 
Search WWH ::




Custom Search