Handling of Imbalanced Data in Text Classification: Category-Based Term Weights - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

10.3 Inspiration from Feature Selection

Feature selection serves as a key procedure to reduce the dimensionality of input

space in order to save computational cost. It has been integrated as a default step

for many learning algorithms, like Artificial Neural Networks, k -Nearest Neighbors,

Decision Trees, etc. In the research community of ML, the computational constraints

imposed by the high dimensionality of the input data space and the richness of in-

formation it provides to maximally identify each individual object is a well known

tradeoff. The ability of feature selection to capture the salient information by select-

ing the most important attributes, and thus making the computing tasks tractable

has been shown in IR and ML research [14, 32, 37, 47]. Furthermore, feature selec-

tion is also beneficial since it tends to reduce the over-fitting problem, in which the

trained objects are tuned to fit very well the data upon which they have been built,

but performs poorly when applied to unseen data [40].

In TC, several feature selection methods have been intensively studied to distill

the important terms while still keeping the dimensionality low. Table 10.1 shows the

general functions of several popular feature selection methods. These methods derive

either from the information theory or from the linear algebra literature [40, 47].

Table 10.1. Several feature selection methods and their functions, where t k denotes

a term, c i stands for a category, P ( t k ,c i ) denotes the prob a bility a document is from

category c i when term t k occurs at least once in it, P ( t k , c i ) denotes the prob ab ility

a document is not from category c i when term t k occurs at least once in it, P ( t k ,c i )

denotes t h e probability a document is from category c i when term t k does not occur

in it, P ( t k , c i ) denotes the probability a document is not from category c i when term

t k does not occur in it.

Feature Selection Method

Mathematical Form

P ( t k ,c i )

P ( t

P ( t k ,c i )

P ( t

Information Gain

P ( t k ,c i )log

+ P ( t k ,c i )log

k ) P ( c i )

P ( t k ,c i )

P ( t

Mutual Information

log

k ) P ( c

i )

i )] 2

N [ P ( t

i ) P ( t

, c

i ) −P ( t

i ) P ( t

Chi-square

P ( t k ) P ( t k ) P ( c i )) P ( c i )

N [ P ( t k ,c i ) P ( t k ,c i ) − P ( t k ,c i ) P ( t k ,c i )]

√ P ( t k ) P ( t k ) P ( c i ) P ( c i )

Correlation Coe cient

log P ( t k | c i )(1 − P ( t k | c i ))

(1 − P ( t k |c i )) P ( t k |c i )

Odds Ratio

Simplified Chi-square

P ( t k ,c i ) P ( t k , c i ) − P ( t k , c i ) P ( t k ,c i )

Basically, there are two distinct ways to rank and assess the features, i.e., globally

and locally.

a) Global feature selection aims to select features which are good across all cate-

gories. In this manner, either the sum f sum ( t k )= |c|

i =1

f ( t k ,c i ), or the weighted

average of a term t k , i.e., f avg ( t k )= |c|

i =1

P ( c i ) f ( t k ,c i ), is assessed, where f

is the specified feature selection method, P ( c i ) is the percentage of category c i

and |c| denotes the number of categories.

b) Local feature selection aims to differentiate those terms that are more distin-

guishable for certain categories only. Usually, terms are ranked and selected

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home