Techniques for Discrimination-Free Predictive Models - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

12.1

Introduction

Classifier construction is one of the most popular data mining and machine learning

techniques (see also Chapter 2 of this topic). We assume that a training set in which

labels are assigned to the instances is given. The labels indicate the class the train-

ing examples belong to, and will hence often be called the class labels . The training

examples are represented by tuples over a set of attributes; that is, every example

will be described by values for the same set of attributes. The attribute containing

the label will be called the class attribute . The label of an example is hence its value

for the class attribute. In Table 12.1 an example training set is given. Every example

corresponds to a person and is described by the attributes gender , ethnicity , highest

degree , job type ,andthe class attribute determining whether or not this person be-

longs to the class of people with a high income (label '+'), or a low income (label

'-'). A classifier construction algorithm learns a predictive model for labeling new,

unlabeled data. For the given example, a classifier construction algorithm would

learn a model for predicting if a person has a high income or not, based upon this

person's gender, ethnicity, degree, and job type. Many algorithms for learning var-

ious classes of classification models have been proposed during the last decades.

The quality of a classifier is measured by its predictive accuracy when classifying

previously unseen examples. To assess the accuracy of a classifier, usually a labeled

test-set is used; test samples from which the label is removed are classified by the

model and the predicted label is compared to the true label.

For the vast majority of these classification techniques maximizing accuracy is

the only objective; i.e, when the classifier is applied on new data, the percentage

of correctly labeled instances should be as high as possible. As explained in de-

tail in Chapter 3 of this topic, however, blindly optimizing for high accuracy may

lead to undesirable side-effects such as discriminatory classifiers. In this chapter we

study the following fictitious case: a bank wants to attract new, preferably rich cus-

tomers.For this purpose, the dataset of Table 12.1 of its current clients is gathered

and labeled according to their income. On the basis of this dataset, a classifier is

learnt and applied on the profiles of some prospective clients. If the classifier pre-

dicts that the candidate has a high income, a special promotion will be offered to

him or her. Such promotional schemes targeting particularly profitable groups are

not uncommon in commercial settings. In the dataset of Table 12.1, however, we

can clearly observe that the positive label is strongly correlated to males and to the

native people. As a result, the promotional scheme will mainly benefit the group of

native males, potentially leading to ethical and legal issues. We will use this scenario

as a running example.

In this chapter, we concentrate on the very specific case in which the input data

for training a classifier can be discriminatory; for instance due to historical discrim-

ination in decision making. And, it is either forbidden by law, or ethically unaccept-

able, that a classifier learns and applies this discrimination on new instances. We

assume that the class label that needs to be predicted can take two values:

.

Furthermore, there is only one sensitive attribute S that can take two values; one for

the deprived community ( f for “female”), and one for the favored community ( m for

+

and

−

Discrimination and Privacy in the Information Society

Search WWH ::

Custom Search

Home