Database Reference
In-Depth Information
objective is to achieve as good accuracy as possible on unseen new data. Accuracy
is the share of correct predictions in the total number of predictions.
Computational models are built and trained by data mining experts using histor-
ical data. The performance and properties of a model depend, among other factors,
on the historical data that has been used to train it. This section provides an over-
view of the computational modeling process and discusses the expected properties
of the historical data. The next section will discuss how these properties translate
into models that may result in biased decision making.
3.2.1 Modeling Assumptions
Computational models typically rely on the assumptions, that (1) the characteris-
tics of the population will stay the same in the future when the model is applied,
and (2) the training data represents the population well. These assumptions are
known as the i.i.d. setting, which stands for independently identically distributed
random variables (see e.g. Duda, Hart and Stork, 2001).
The first assumption is that the characteristics of the population from which the
training sample is collected are the same as the characteristics of the population on
which the model will be applied. If this assumption is violated, models may fail to
perform accurately (Kelly, Hand and Adams, 1999). For instance, the repayment
patterns of people working in the car manufacturing industry may be different at
times of economic boom as compared to times of economic crisis. A model
trained at times of boom may not be that accurate at times of crises. Or, a model
trained on data collected in Brazil may not be correct to predict the performance of
customers in Germany.
The second assumption is satisfied if our historical dataset closely resembles
the population of the applicants in the market. That means, for instance, that our
training set needs to have the same share of good and bad clients as the market,
the same distribution of ages as in the market, the proportion of males and fe-
males, and the same proportion high-skilled and low-skilled labor. In short, the
second assumption implies that our historical database is a small copy of a large
population out there in the market. If the assumption is violated, then our training
data is incomplete and a model trained on such data may perform sub-optimally
(Zadrozny, 2004).
The representation of the population in our database may be inaccurate in two
ways. Either the selection of people to be included may be biased or the selection
of attributes by which people are described in our database may be incomplete.
Suppose that a bank collects a dataset consisting only of people that live in a ma-
jor city. A model is trained on this data and then it is applied to all incoming
customers, including the ones that live in remote rural areas, and have different
employment opportunities and spending habits. The model may not perform well
on the rural customers, since the training was forced to focus on the city custom-
ers. Or suppose that a bank collects a representative sample of clients, but does not
ask about the stability of income of people, which is considered to be one of the
main factors in credit performance. Without this information the model will treat
Search WWH ::




Custom Search