Information Technology Reference
In-Depth Information
10.3 Inspiration from Feature Selection
Feature selection serves as a key procedure to reduce the dimensionality of input
space in order to save computational cost. It has been integrated as a default step
for many learning algorithms, like Artificial Neural Networks, k -Nearest Neighbors,
Decision Trees, etc. In the research community of ML, the computational constraints
imposed by the high dimensionality of the input data space and the richness of in-
formation it provides to maximally identify each individual object is a well known
tradeoff. The ability of feature selection to capture the salient information by select-
ing the most important attributes, and thus making the computing tasks tractable
has been shown in IR and ML research [14, 32, 37, 47]. Furthermore, feature selec-
tion is also beneficial since it tends to reduce the over-fitting problem, in which the
trained objects are tuned to fit very well the data upon which they have been built,
but performs poorly when applied to unseen data [40].
In TC, several feature selection methods have been intensively studied to distill
the important terms while still keeping the dimensionality low. Table 10.1 shows the
general functions of several popular feature selection methods. These methods derive
either from the information theory or from the linear algebra literature [40, 47].
Table 10.1. Several feature selection methods and their functions, where t k denotes
a term, c i stands for a category, P ( t k ,c i ) denotes the prob a bility a document is from
category c i when term t k occurs at least once in it, P ( t k , c i ) denotes the prob ab ility
a document is not from category c i when term t k occurs at least once in it, P ( t k ,c i )
denotes t h e probability a document is from category c i when term t k does not occur
in it, P ( t k , c i ) denotes the probability a document is not from category c i when term
t k does not occur in it.
Feature Selection Method
Mathematical Form
P ( t k ,c i )
P ( t
P ( t k ,c i )
P ( t
Information Gain
P ( t k ,c i )log
+ P ( t k ,c i )log
k ) P ( c i )
k ) P ( c i )
P ( t k ,c i )
P ( t
Mutual Information
log
k ) P ( c
i )
i )] 2
N [ P ( t
,c
i ) P ( t
, c
i ) −P ( t
k
,c
i ) P ( t
,c
Chi-square
k
k
k
P ( t k ) P ( t k ) P ( c i )) P ( c i )
N [ P ( t k ,c i ) P ( t k ,c i ) P ( t k ,c i ) P ( t k ,c i )]
P ( t k ) P ( t k ) P ( c i ) P ( c i )
Correlation Coe cient
log P ( t k | c i )(1 P ( t k | c i ))
(1 P ( t k |c i )) P ( t k |c i )
Odds Ratio
Simplified Chi-square
P ( t k ,c i ) P ( t k , c i ) − P ( t k , c i ) P ( t k ,c i )
Basically, there are two distinct ways to rank and assess the features, i.e., globally
and locally.
a) Global feature selection aims to select features which are good across all cate-
gories. In this manner, either the sum f sum ( t k )= |c|
i =1
f ( t k ,c i ), or the weighted
average of a term t k , i.e., f avg ( t k )= |c|
i =1
P ( c i ) f ( t k ,c i ), is assessed, where f
is the specified feature selection method, P ( c i ) is the percentage of category c i
and |c| denotes the number of categories.
b) Local feature selection aims to differentiate those terms that are more distin-
guishable for certain categories only. Usually, terms are ranked and selected
 
Search WWH ::




Custom Search