Database Reference
In-Depth Information
they are no longer appropriate. Reasons could be, e.g., explicit discrimination,
or a change in labeling in the future. This corresponds to assumption 1 of Sec-
tion 4.2.1 being violated.
The sampling procedure is biased: the labels are correct and unbiased, but par-
ticular groups are under- or overrepresented in the data, leading to incorrect in-
ferences by the classifier induction. This corresponds to assumption 2 (first
principled way) of Section 4.2.1 being violated.
The data is incomplete; there are hidden attributes: often not all attributes that
determine the label are being monitored. Often because of reasons of privacy or
just because they are difficult to observe. In such a situation it may happen that
sensitive attributes are used as a proxy and indirectly lead to discriminatory
models. This corresponds to assumption 2 (second principled way) of Section
4.2.1 being violated.
3.3.1 Accuracy and Discrimination
Suppose that the task is to learn a classifier that divides new bank customers into
two groups: likely to repay and unlikely to repay . Based on historical data of exist-
ing customers and whether or not they repaid their loans, we learn a classifier. A
classifier is a mathematical model that allows us to extrapolate based on observa-
ble attributes such as gender, age, profession, education, income, address, and out-
standing loans to make predictions. Recall that the accuracy of a classifier learned
on such data is defined as the percentage of predictions of the classifier that are
correct. To assess this key performance measure before actually deploying the
model in practice, usually some labeled data (i.e., instances of which we already
know the outcome) is used, that has been put aside for this purpose and not been
used during the learning process.
Our analysis is based upon the following two assumptions about classification
process.
Assumption 1: the classifier learning process is only aimed at obtaining an accu-
racy as high as possible. No other objective is strived for during the data mining
phase.
Assumption 2: A classifier discriminates with respect to a sensitive attribute, e.g.
gender, if for two persons which only differ by their gender (and maybe some cha-
racteristics irrelevant for the classification problem at hand) that classifier predicts
different labels.
Note that the two persons in assumption 2 only need to agree on relevant characte-
ristics. Otherwise one could easily circumvent the definition by claiming that a
person was not discriminated based on gender, but instead because she was wear-
ing a skirt. Although people “wearing a skirt” do not constitute a protected-by-law
subpopulation, using such an attribute would be unacceptable given its high corre-
lation with gender and that characteristics such as “wearing a skirt” are considered
to be irrelevant for credit scoring. Often, however, it is far less obvious to separate
relevant and irrelevant attributes. For instance, in a mortgage application an ad-
dress may at the same time be important to assess the intrinsic value of a property,
Search WWH ::




Custom Search