Java Reference
In-Depth Information
transform such attributes into numerical attributes. This is necessary
for those algorithms, like k-means, that work only with numerical
data. As such, explosion is used on categorical data. Some category
sets do not have any order among them (e.g., attribute marital status
with values married, single, divorced, widowed ). These can be exploded
using what is called the indicator technique. For marital status, four
new attributes, created with names of the categories, replace the orig-
inal attribute. For each case, the new attribute corresponding to the
value in marital status is set to 1; all others are set to 0. If a case has the
value married, then the attribute named “married” is given a 1, and
the other three new attributes are given a zero.
If an attribute contains values that are ordered, such as customer
satisfaction with values high, medium, and low, that attribute can be
exploded using a technique called thermometer . For customer
satisfaction, three new attributes are created with names of the cate-
gories: high, medium, and low. These replace the original attribute.
For each case in the dataset, the new attribute corresponding to the
value in customer satisfaction is set to 1, as well as those new
attributes ordered less than it. Remaining new attributes are set to 0.
For example, if a case has the value medium, then the attributes
named “medium” and “low” are set to 1 and the “high” attribute is
set to zero. These are illustrated in Figure 3-5.
Marital
Status
Married
Single
Widowed
Divorced
Married
1
0
0
0
Indicator
Single
0
1
0
0
Widowed
0
0
1
0
Divorced
0
0
0
1
Customer
Satisfaction
High
Medium
Low
High
1
1
1
Thermometer
Medium
0
1
1
Low
0
0
1
Figure 3-5
Exploding attribute: indicator and thermometer approach.
 
Search WWH ::




Custom Search