Handling of Imbalanced Data in Text Classification: Category-Based Term Weights - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

based on f max ( t k )=max |c|

i =1

f ( t k ,c i ), i.e., the maximum of the category spe-

cific values.

The sense of either 'global' or 'local' does not have much impact on the selection

of method itself, but it does affect the performance of classifiers built upon different

categories. In TC, the main purpose is to address whether this document belongs to

a specific category. Obviously, we prefer the salient features which are unique from

one category to another, i.e., a 'local' approach. Ideally, the salient feature set from

one category does not have any items overlapping with those from other categories.

If this cannot be avoided, then how to better present them comes into the picture.

While many previous works have shown the relative strengths and merits of these

methods [14, 32, 37, 40, 47], our experience with feature selection over a number

of standard or ad-hoc data sets shows the performance of such methods can be

highly dependant on the data. This is partly due to the lack of understanding of

different data sets in a quantitative way, and it needs further research. From our

previous study of all feature selection methods and what has been reported in the

literature [47], we noted when these methods are applied to text classification for

term selection purpose, they are basically utilizing the four fundamental information

elements shown in Table 10.2, i.e., A denotes the number of documents belonging

to category c i where the term t k occurs at least once; B denotes the number of

documents not belonging to category c i where the term t k occurs at least once; C

denotes the number of documents belonging to category c i where the term t k does

not occur; D denotes the number of documents not belonging to category c i where

the term t k does not occur.

Table 10.2. Fundamental information elements used for feature selection in text

classification

c i c i

t k A B

t k C D

These four information elements have been used to estimate the probabilities

listed in Table 10.1. Table 10.3 shows the functions in Table 10.1 as presented by

the information elements A , B , C and D .

Table 10.3. Feature selection methods and their formations as represented by in-

formation elements in Table 10.2

Method

Mathematical Form Represented by Information Elements

A + C

N

log A + C

N

A

N log(

A + B )+ N log

A

C

C + D

Information Gain

−

+

Mutual Information

log( AN/ ( A + B )( A + C ))

N ( AD − BC ) 2 / ( A + C )( B + D )( A + B )( C + D )

Chi-square

√ N ( AD − BC ) / ( A + C )( B + D )( A + B )( C + D )

Correlation Coe cient

Odds Ratio

log( AD/BC )

( AD − BC ) /N 2

Simplified Chi-square

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home