Database Reference
In-Depth Information
12.1
Introduction
Classifier construction is one of the most popular data mining and machine learning
techniques (see also Chapter 2 of this topic). We assume that a training set in which
labels are assigned to the instances is given. The labels indicate the class the train-
ing examples belong to, and will hence often be called the class labels . The training
examples are represented by tuples over a set of attributes; that is, every example
will be described by values for the same set of attributes. The attribute containing
the label will be called the class attribute . The label of an example is hence its value
for the class attribute. In Table 12.1 an example training set is given. Every example
corresponds to a person and is described by the attributes gender , ethnicity , highest
degree , job type ,andthe class attribute determining whether or not this person be-
longs to the class of people with a high income (label '+'), or a low income (label
'-'). A classifier construction algorithm learns a predictive model for labeling new,
unlabeled data. For the given example, a classifier construction algorithm would
learn a model for predicting if a person has a high income or not, based upon this
person's gender, ethnicity, degree, and job type. Many algorithms for learning var-
ious classes of classification models have been proposed during the last decades.
The quality of a classifier is measured by its predictive accuracy when classifying
previously unseen examples. To assess the accuracy of a classifier, usually a labeled
test-set is used; test samples from which the label is removed are classified by the
model and the predicted label is compared to the true label.
For the vast majority of these classification techniques maximizing accuracy is
the only objective; i.e, when the classifier is applied on new data, the percentage
of correctly labeled instances should be as high as possible. As explained in de-
tail in Chapter 3 of this topic, however, blindly optimizing for high accuracy may
lead to undesirable side-effects such as discriminatory classifiers. In this chapter we
study the following fictitious case: a bank wants to attract new, preferably rich cus-
tomers.For this purpose, the dataset of Table 12.1 of its current clients is gathered
and labeled according to their income. On the basis of this dataset, a classifier is
learnt and applied on the profiles of some prospective clients. If the classifier pre-
dicts that the candidate has a high income, a special promotion will be offered to
him or her. Such promotional schemes targeting particularly profitable groups are
not uncommon in commercial settings. In the dataset of Table 12.1, however, we
can clearly observe that the positive label is strongly correlated to males and to the
native people. As a result, the promotional scheme will mainly benefit the group of
native males, potentially leading to ethical and legal issues. We will use this scenario
as a running example.
In this chapter, we concentrate on the very specific case in which the input data
for training a classifier can be discriminatory; for instance due to historical discrim-
ination in decision making. And, it is either forbidden by law, or ethically unaccept-
able, that a classifier learns and applies this discrimination on new instances. We
assume that the class label that needs to be predicted can take two values:
.
Furthermore, there is only one sensitive attribute S that can take two values; one for
the deprived community ( f for “female”), and one for the favored community ( m for
+
and
Search WWH ::




Custom Search