Data Preprocessing - Data Mining: Concepts and Techniques - page 104

Databases Reference

In-Depth Information

Attribute subset selection 4 reduces the data set size by removing irrelevant or

redundant attributes (or dimensions). The goal of attribute subset selection is to find

a minimum set of attributes such that the resulting probability distribution of the data

classes is as close as possible to the original distribution obtained using all attributes.

Mining on a reduced set of attributes has an additional benefit: It reduces the number

of attributes appearing in the discovered patterns, helping to make the patterns easier to

understand.

“How can we find a 'good' subset of the original attributes?” For n attributes, there are

2 n possible subsets. An exhaustive search for the optimal subset of attributes can be pro-

hibitively expensive, especially as n and the number of data classes increase. Therefore,

heuristic methods that explore a reduced search space are commonly used for attribute

subset selection. These methods are typically greedy in that, while searching through

attribute space, they always make what looks to be the best choice at the time. Their

strategy is to make a locally optimal choice in the hope that this will lead to a globally

optimal solution. Such greedy methods are effective in practice and may come close to

estimating an optimal solution.

The “best” (and “worst”) attributes are typically determined using tests of statistical

significance, which assume that the attributes are independent of one another. Many

other attribute evaluation measures can be used such as the information gain measure

used in building decision trees for classification. 5

Basic heuristic methods of attribute subset selection include the techniques that

follow, some of which are illustrated in Figure 3.6.

Forward selection

Initial attribute set:

{ A 1 , A 2 , A 3 , A 4 , A 5 , A 6 }

Backward elimination

Decision tree induction

Initial attribute set:

{ A 1 , A 2 , A 3 , A 4 , A 5 , A 6 }

Initial attribute set:

{ A 1 , A 2 , A 3 , A 4 , A 5 , A 6 }

Initial reduced set:

{}

=> { A 1 }

=> { A 1 , A 4 }

=> Reduced attribute set:

{ A 1 , A 4 , A 6 }

=> { A 1 , A 3 , A 4 , A 5 , A 6 }

=> { A 1 , A 4 , A 5 , A 6 }

=> Reduced attribute set:

{ A 1 , A 4 , A 6 }

A 4 ?

Y

N

A 1 ?

A 6 ?

Y

N

Y

N

Class 1

Class 2

Class 1

Class 2

=> Reduced attribute set:

{ A 1 , A 4 , A 6 }

Figure 3.6 Greedy (heuristic) methods for attribute subset selection.

4 In machine learning, attribute subset selection is known as feature subset selection .

5 The information gain measure is described in detail in Chapter 8.

Next Page

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home