All Relevant Feature Selection Methods and Applications - Feature Selection for Data and Pattern Recognition

Information Technology Reference

In-Depth Information

2.1 Introduction

The usual goal of feature selection in machine learning is to find the best set of

features that allows one to build useful models of studied phenomena. The chapter

is devoted to a different application of feature selection process, where building

a machine learning model is merely a tool for extracting all features that are relevant

for a problem. The relevance is considered in a broad sense—it is sufficient for a

feature to be declared relevant, when it is useful for building a machine learning

model of the problem under scrutiny at some context. One may ask why this goal is

relevant at all? Why should anyone be interested in this type of relevance?

Let us firstly describe a toy problem that illustrates a need for the all-relevant

feature selection in an artificially transparent setting. Let us construct a system con-

taining 100 objects described with one hundred real-valued variables X 1 ,..., 100 , and

one binary decision variable D . The descriptive variables X 1 and X 2 are drawn from

a normal distribution N

. The value of the decision variable is determined from

values of these variables in the following manner. It is one ( TRUE ) if both vari-

ables have the same sign and is zero ( FA L S E ) if their signs differ. The descriptive

variables X 3 ,...,

(

0

,

1

)

X 10 are obtained as a linear combination of X 1 and X 2 , and nor-

malised to N

(

0

,

1

)

. The variables X 11 ,...,

X 100 , are drawn from a normal distribu-

tion N

. Finally the indexes of the variables are randomly permuted. The goal

of the researcher is to determine which variables are responsible for the value of a

decision variable.

There is a very easy path to the solution of this problem. One could take a clas-

sifier that is able to rank feature importance and select two most important features.

Unfortunately this path may lead us astray, as displayed in Fig. 2.1 , that shows the

ranking of feature importance for our toy problem returned by a random forest (RF)

[ 2 ] classifier. Here, for clarity, variable indexes are not permuted.

The toy problem is simple enough that it can be solved directly by a brute force

approach. It is sufficient to build 4,950models including two variables to find one that

gives perfect classification and hence is the most likely to be built on two variables

used to generate the model. However, for real life problems a number of descriptive

variables may be much larger, connections between these variables and decision may

be more complicated, measurements are subject to noise. Moreover, one does not

know beforehand how many variables influence decision. Finally, while for our toy

problem the model based on two variables used to generate the model usually gives

best results, this is not guaranteed to work in a general case. Hence the brute force

approach will not work in most cases.

As an example of a real-life application we may consider deciphering connection

between gene expression levels in humans with somemedical condition. In this case a

number of variables is roughly twenty thousands, it is not known howmany genes are

involved and how, and last but not least—measurements are subject both to normal

variability and experimental error. Analysis of such problem can be split into two

separate tasks: determination which variables are connected in some way with the

decision variable, and then identification of those variables that are responsible for

(

0

,

1

)

Feature Selection for Data and Pattern Recognition

Search WWH ::

Custom Search

Home