Database Reference
In-Depth Information
Data Labeling
Third, the historical data to be used for training a model contains the true labels ,
which in certain cases may be incorrect and contain prejudices. Labels are the tar-
gets that an organization wants to predict for new incoming instances. The true la-
bels in the historical data may be objective or subjective . The labels are objective
when assigning these labels, no human interpretation was involved; the labels are
hard in the sense that there can be no disagreement about their correctness be-
tween different human observers. Examples of objective labels include the indica-
tors weather an existing bank customer repaid a credit or not, whether a suspect
was wearing a concealed weapon, or whether a driver tested positive or negative
for alcohol intoxication. Examples of subjective labels include the assessment of a
human resource manager if a job candidate is suitable for a particular job, if a
client of a bank should get a loan or not, accepting or denying a student to a uni-
versity, the decision whether or not to detain a suspect. For the subjective labels
there is a gray area in which human judgment may have influenced the labeling
resulting in a bias in the target attribute. In contrast to the objective labels, here
there may be disagreement between different observers; different people may as-
sess a job candidate or student application differently; the notion of what is the
correct label is fuzzy.
The distinction between subjective and objective labels is important in assess-
ing and preventing discrimination. Only the subjective labels can be incorrect due
to biased decision making in the historical data. For instance, if females have been
discriminated in university admission, some labels in our database saying whether
persons should be admitted will be incorrect according to the present non-
discriminatory regulations. Objective labels, on the other hand, will be correct
even if our database is collected in a biased manner. For instance, we may choose
to detain suspects selectively, but the resulting true label whether a given suspect
actually carried a gun or not will be measurable and is thus objectively correct.
The computational modeling process requires an insightful analysis of the ori-
gins and properties of training data. Due to origins of data the computational mod-
els trained on this data may be based on incorrect assumptions, and as a result, as
we will see in the next section, may lead to biased decision making.
3.3 Types of Problems
In this section we discuss three scenarios that show how the violation of the as-
sumptions sketched in the previous section may affect the validity of models
learned on data and lead to discriminatory decision procedures. In all three scena-
rios we explicitly assume that the only goal of data mining is to optimize accuracy
of predictions, i.e. there is no incentive to discriminate based on taste. Before we
go into the scenarios, we first recall the important notion of accuracy of predic-
tions and we explain how we will assess discrimination of a classifier. Then we
will deal with three scenarios illustrating the following situations:
Labels are incorrect: due to historical discrimination the labels are biased. Even
though the labels accurately represent decisions of the past, for the future task
Search WWH ::




Custom Search