Information Technology Reference
In-Depth Information
2.1 Introduction
The usual goal of feature selection in machine learning is to find the best set of
features that allows one to build useful models of studied phenomena. The chapter
is devoted to a different application of feature selection process, where building
a machine learning model is merely a tool for extracting all features that are relevant
for a problem. The relevance is considered in a broad sense—it is sufficient for a
feature to be declared relevant, when it is useful for building a machine learning
model of the problem under scrutiny at some context. One may ask why this goal is
relevant at all? Why should anyone be interested in this type of relevance?
Let us firstly describe a toy problem that illustrates a need for the all-relevant
feature selection in an artificially transparent setting. Let us construct a system con-
taining 100 objects described with one hundred real-valued variables X 1 ,..., 100 , and
one binary decision variable D . The descriptive variables X 1 and X 2 are drawn from
a normal distribution N
. The value of the decision variable is determined from
values of these variables in the following manner. It is one ( TRUE ) if both vari-
ables have the same sign and is zero ( FA L S E ) if their signs differ. The descriptive
variables X 3 ,...,
(
0
,
1
)
X 10 are obtained as a linear combination of X 1 and X 2 , and nor-
malised to N
(
0
,
1
)
. The variables X 11 ,...,
X 100 , are drawn from a normal distribu-
tion N
. Finally the indexes of the variables are randomly permuted. The goal
of the researcher is to determine which variables are responsible for the value of a
decision variable.
There is a very easy path to the solution of this problem. One could take a clas-
sifier that is able to rank feature importance and select two most important features.
Unfortunately this path may lead us astray, as displayed in Fig. 2.1 , that shows the
ranking of feature importance for our toy problem returned by a random forest (RF)
[ 2 ] classifier. Here, for clarity, variable indexes are not permuted.
The toy problem is simple enough that it can be solved directly by a brute force
approach. It is sufficient to build 4,950models including two variables to find one that
gives perfect classification and hence is the most likely to be built on two variables
used to generate the model. However, for real life problems a number of descriptive
variables may be much larger, connections between these variables and decision may
be more complicated, measurements are subject to noise. Moreover, one does not
know beforehand how many variables influence decision. Finally, while for our toy
problem the model based on two variables used to generate the model usually gives
best results, this is not guaranteed to work in a general case. Hence the brute force
approach will not work in most cases.
As an example of a real-life application we may consider deciphering connection
between gene expression levels in humans with somemedical condition. In this case a
number of variables is roughly twenty thousands, it is not known howmany genes are
involved and how, and last but not least—measurements are subject both to normal
variability and experimental error. Analysis of such problem can be split into two
separate tasks: determination which variables are connected in some way with the
decision variable, and then identification of those variables that are responsible for
(
0
,
1
)
 
Search WWH ::




Custom Search