Why Unbiased Computational Processes Can Lead to Discriminative Decision Procedures - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

objective is to achieve as good accuracy as possible on unseen new data. Accuracy

is the share of correct predictions in the total number of predictions.

Computational models are built and trained by data mining experts using histor-

ical data. The performance and properties of a model depend, among other factors,

on the historical data that has been used to train it. This section provides an over-

view of the computational modeling process and discusses the expected properties

of the historical data. The next section will discuss how these properties translate

into models that may result in biased decision making.

3.2.1 Modeling Assumptions

Computational models typically rely on the assumptions, that (1) the characteris-

tics of the population will stay the same in the future when the model is applied,

and (2) the training data represents the population well. These assumptions are

known as the i.i.d. setting, which stands for independently identically distributed

random variables (see e.g. Duda, Hart and Stork, 2001).

The first assumption is that the characteristics of the population from which the

training sample is collected are the same as the characteristics of the population on

which the model will be applied. If this assumption is violated, models may fail to

perform accurately (Kelly, Hand and Adams, 1999). For instance, the repayment

patterns of people working in the car manufacturing industry may be different at

times of economic boom as compared to times of economic crisis. A model

trained at times of boom may not be that accurate at times of crises. Or, a model

trained on data collected in Brazil may not be correct to predict the performance of

customers in Germany.

The second assumption is satisfied if our historical dataset closely resembles

the population of the applicants in the market. That means, for instance, that our

training set needs to have the same share of good and bad clients as the market,

the same distribution of ages as in the market, the proportion of males and fe-

males, and the same proportion high-skilled and low-skilled labor. In short, the

second assumption implies that our historical database is a small copy of a large

population out there in the market. If the assumption is violated, then our training

data is incomplete and a model trained on such data may perform sub-optimally

(Zadrozny, 2004).

The representation of the population in our database may be inaccurate in two

ways. Either the selection of people to be included may be biased or the selection

of attributes by which people are described in our database may be incomplete.

Suppose that a bank collects a dataset consisting only of people that live in a ma-

jor city. A model is trained on this data and then it is applied to all incoming

customers, including the ones that live in remote rural areas, and have different

employment opportunities and spending habits. The model may not perform well

on the rural customers, since the training was forced to focus on the city custom-

ers. Or suppose that a bank collects a representative sample of clients, but does not

ask about the stability of income of people, which is considered to be one of the

main factors in credit performance. Without this information the model will treat

Discrimination and Privacy in the Information Society

Search WWH ::

Custom Search

Home