Database Reference
In-Depth Information
Table 2.2 Historical data and model-generated prediction fields.
Input fields Output Model-generated
field fields
Customer Gender Profession Monthly Monthly Response Predicted Estimated
ID
average
average
to pilot
response
response
number number
campaign
confidence
of SMS
of voice
score
calls
calls
1
Male White collar
28
140
No
No
0.0
2
Male
Blue collar
32
54
No
No
0.0
3
Female
Blue collar
57
30
No
No
0.0
4
Male White collar
143
140
Yes
Yes
1.0
5
Female White collar
87
81
No
No
0.0
6
Male
Blue collar
143
28
No
No
0.0
7
Female White collar
150
140
No
No
0.0
8
Male White collar
140
60
Yes
Yes
1.0
This comparison provides an estimate of the model's future predictive accu-
racy on unseen cases. In order to make this procedure more valid, it is advisable
to evaluate the model in a dataset that was not used for training the model. This
is achieved by partitioning the historical dataset into two distinct parts through
random sampling: the training and the testing dataset. A common practice is to
allocate approximately 70-75% of the cases to the training dataset. Evaluation
procedures are applied to both datasets. Analysts should focus mainly on the
examination of performance indicators in the testing dataset. A model underper-
forming in the testing dataset should be re-examined since this is a typical sign of
overfitting and of memorizing the specific training data. Models with this behavior
do not provide generalizable results. They provide solutions that only work for the
particular data on which they were trained.
Some analysts use the testing dataset to refine the model parameters and leave
a third part of the data, namely the validation dataset, for evaluation. However, the
best approach, which unfortunately is not always employed, would be to test the
model's performance in a third, disjoint dataset from a different time period.
One of the most common performance indicators for classification models is
the error rate. It measures the percentage of misclassifications. The overall error
rate indicates the percentage of records that were not correctly classified by the
model. Since some mistakes may be more costly than others, this percentage is
also estimated for each category of the target field. The error rate is summarized
in misclassification or coincidence or confusion matrices that have the form given
in Table 2.3.
Search WWH ::




Custom Search