Database Reference
In-Depth Information
target attribute changes between the training data and the new data to which the
learned model is applied, i.e. the dependence on the attribute gender decreases.
Such background knowledge may encourage an analyst to apply discrimina-
tion-aware techniques that try to learn the part of the relation between the de-
mographic features and the income that is independent of the gender of that
person. In this way the analyst kills two birds with one stone: the classifier will
be less discriminatory and at the same time more accurate.
3.3.3 Scenario 2: Sampling Bias
In this scenario training data may be biased, i.e. some groups of individuals may
be over- or underrepresented, even though the labels themselves are correct. As
we will show, such a sample bias may lead to biased decisions.
Let us consider the following example of over- and underrepresented groups in
studies. To reduce the number of car accidents, the police increases the number of
alcohol checks in a particular area. It is generally accepted that young drivers
cause more accidents than older drivers; for example, a study by Jonah (1986)
confirms that young (16-25) drivers (a) are at greater risk of being involved in a
casualty accident than older drivers and (b) this greater risk is primarily a func-
tion of their propensity to take risks while driving ). Because of that, the police of-
ten specifically targets this group of young drivers in their checks. People in the
category “ over 40” are checked only sporadically, when there is a strong incentive
or suspicion of intoxication. After the campaign, it is decided to analyze the data
in order to find specific groups in society that are particularly prone to alcohol
abuse in traffic. A classification model is learned on the data to predict, given the
age, ethnicity, social class, car type, gender, whether a person is more or less like-
ly to drive while being intoxicated. Since only the labels are known for those
people that were actually checked, only this data is used in the study. Due to data
collection procedure there is a clear sample bias in the training data: only those
people that were checked are in the dataset, while this is not a representative sam-
ple of all people that participate in the traffic. Analysis of this dataset could surpri-
singly conclude that particularly women of over 40 represent a danger of being
intoxicated while driving. Such a finding is explainable by the fact that according
to the examples presented to the classifier, middle aged women are more intox-
icated than on average. A factor that was disregarded in this analysis, however, is
that middle-aged women were only checked by the police when there was a more
than serious suspicion of intoxication. Even though in this example it is obvious
what went wrong in the analysis, sample bias is a very common and hard to solve
problem. Think, e.g., of medical studies only involving people exhibiting certain
symptoms, or enquiries by telephone that are only conducted for people whose
phone number appeared on the list used by the marketing bureau. Depending on
the source of the list that may have been purchased from other companies, particu-
lar groups may be over- or underrepresented.
Search WWH ::




Custom Search