Why Unbiased Computational Processes Can Lead to Discriminative Decision Procedures - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

target attribute changes between the training data and the new data to which the

learned model is applied, i.e. the dependence on the attribute gender decreases.

Such background knowledge may encourage an analyst to apply discrimina-

tion-aware techniques that try to learn the part of the relation between the de-

mographic features and the income that is independent of the gender of that

person. In this way the analyst kills two birds with one stone: the classifier will

be less discriminatory and at the same time more accurate.

3.3.3 Scenario 2: Sampling Bias

In this scenario training data may be biased, i.e. some groups of individuals may

be over- or underrepresented, even though the labels themselves are correct. As

we will show, such a sample bias may lead to biased decisions.

Let us consider the following example of over- and underrepresented groups in

studies. To reduce the number of car accidents, the police increases the number of

alcohol checks in a particular area. It is generally accepted that young drivers

cause more accidents than older drivers; for example, a study by Jonah (1986)

confirms that young (16-25) drivers (a) are at greater risk of being involved in a

casualty accident than older drivers and (b) this greater risk is primarily a func-

tion of their propensity to take risks while driving ). Because of that, the police of-

ten specifically targets this group of young drivers in their checks. People in the

category “ over 40” are checked only sporadically, when there is a strong incentive

or suspicion of intoxication. After the campaign, it is decided to analyze the data

in order to find specific groups in society that are particularly prone to alcohol

abuse in traffic. A classification model is learned on the data to predict, given the

age, ethnicity, social class, car type, gender, whether a person is more or less like-

ly to drive while being intoxicated. Since only the labels are known for those

people that were actually checked, only this data is used in the study. Due to data

collection procedure there is a clear sample bias in the training data: only those

people that were checked are in the dataset, while this is not a representative sam-

ple of all people that participate in the traffic. Analysis of this dataset could surpri-

singly conclude that particularly women of over 40 represent a danger of being

intoxicated while driving. Such a finding is explainable by the fact that according

to the examples presented to the classifier, middle aged women are more intox-

icated than on average. A factor that was disregarded in this analysis, however, is

that middle-aged women were only checked by the police when there was a more

than serious suspicion of intoxication. Even though in this example it is obvious

what went wrong in the analysis, sample bias is a very common and hard to solve

problem. Think, e.g., of medical studies only involving people exhibiting certain

symptoms, or enquiries by telephone that are only conducted for people whose

phone number appeared on the list used by the marketing bureau. Depending on

the source of the list that may have been purchased from other companies, particu-

lar groups may be over- or underrepresented.

Search WWH ::

Custom Search

Home