An Overview of Data Mining Techniques - Data Mining Techniques in CRM: Inside Customer Segmentation - page 26

Database Reference

In-Depth Information

Table 2.2 Historical data and model-generated prediction fields.

Input fields Output Model-generated

field fields

Customer Gender Profession Monthly Monthly Response Predicted Estimated

ID

average

average

to pilot

response

response

number number

campaign

confidence

of SMS

of voice

score

calls

calls

1

Male White collar

28

140

No

No

0.0

2

Male

Blue collar

32

54

No

No

0.0

3

Female

Blue collar

57

30

No

No

0.0

4

Male White collar

143

140

Yes

Yes

1.0

5

Female White collar

87

81

No

No

0.0

6

Male

Blue collar

143

28

No

No

0.0

7

Female White collar

150

140

No

No

0.0

8

Male White collar

140

60

Yes

Yes

1.0

This comparison provides an estimate of the model's future predictive accu-

racy on unseen cases. In order to make this procedure more valid, it is advisable

to evaluate the model in a dataset that was not used for training the model. This

is achieved by partitioning the historical dataset into two distinct parts through

random sampling: the training and the testing dataset. A common practice is to

allocate approximately 70-75% of the cases to the training dataset. Evaluation

procedures are applied to both datasets. Analysts should focus mainly on the

examination of performance indicators in the testing dataset. A model underper-

forming in the testing dataset should be re-examined since this is a typical sign of

overfitting and of memorizing the specific training data. Models with this behavior

do not provide generalizable results. They provide solutions that only work for the

particular data on which they were trained.

Some analysts use the testing dataset to refine the model parameters and leave

a third part of the data, namely the validation dataset, for evaluation. However, the

best approach, which unfortunately is not always employed, would be to test the

model's performance in a third, disjoint dataset from a different time period.

One of the most common performance indicators for classification models is

the error rate. It measures the percentage of misclassifications. The overall error

rate indicates the percentage of records that were not correctly classified by the

model. Since some mistakes may be more costly than others, this percentage is

also estimated for each category of the target field. The error rate is summarized

in misclassification or coincidence or confusion matrices that have the form given

in Table 2.3.

Next Page

Data Mining Techniques in CRM: Inside Customer Segmentation

Search WWH ::

Custom Search

Home