Database Reference
In-Depth Information
3.3.4 Scenario 3: Incomplete Data
In this scenario training data contains only partial information of the factors that
influence the class label. Often important characteristics are not present because
of, e.g., privacy reasons, or because that data is hard to collect. In such situations a
classifier will use the remaining attributes and get the best accuracy out of it, often
overestimating the importance of the factors that are present in the dataset. Next
we discuss an example of such a situation.
Consider an insurance company that wants to determine the risk category of
new customers, based upon their age, gender, car type, years of driving experience
etc. An important factor that the insurance company cannot take into account,
however, is the driving style of the person. The reason for the absence of this in-
formation is obvious: gathering it; e.g., by questioning his or her relatives, follow-
ing the person while he or she is driving, getting information on the number
of fines the person had during the last few years, would not only be extremely
time-consuming, but would also invade that person's privacy. Therefore, as a con-
sequence, the data is often incomplete and the classifier will have to base its deci-
sions on other available attributes. Based upon the historical data it is observed
that in our example next to the horsepower of the car, age and gender of a person
are highly correlated to the risk (the driving style is hidden for the company), see
Table 1.
Table 1 Example (fictitious) dataset on risk assessment for car insurances based on demo-
graphic features. The attribute Driving style is hidden for the insurance company.
Customer
no.
Gender
Age
Hp
Driving style
Risk
#1
Male
30 years
High
Aggressive
+
#2
Male
35 years
Low
Aggressive
-
#3
Female
24 years
Med.
Calm
-
#4
Female
18 years
Med.
Aggressive
+
#5
Male
65 years
High
Calm
-
#6
Male
54 years
Low
Aggressive
+
#7
Female
21 years
Low
Calm
-
#8
Female
29 years
Med.
Calm
-
From this dataset it is clear that the true decisive factor is the driving style of the
driver, rather than gender or age; all high risk drivers have an aggressive driving
style, and vice versa, only one aggressive driver does not have a high risk. There is
an almost perfect correlation between being an aggressive driver and presenting a
high accident risk in traffic. The driving style, however, is tightly connected to
gender and age. Young male drivers will thus, according to the insurance company,
present a higher danger and hence receive a higher premium. In such a situation we
say that the gender of a person is a so-called proxy for the difficult to observe
attribute driving style. In statistics, a proxy variable describes something that is
probably not in itself of any great interest, but from which a variable of interest can
Search WWH ::




Custom Search