Handling of Imbalanced Data in Text Classification: Category-Based Term Weights - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

c i . Surely, the higher the ratio, the more important the term t k is related to category

c i . A similar analysis can be be made with respect to A/C . The ratio reflects the

expectation that a term is deemed as more relevant if it occurs in the larger portion

of documents from category c i than other terms.

Table 10.4. The different combinations of Category-Based Term Weights (CBTW),

the mathematical forms are represented by information elements shown in Table 10.2

CBTW 1 log(1 + B

A

C )

CBTW 5 log(1 + A + B

B

A + C

C

)

CBTW 2 log(1 + B + C )

CBTW 6 log(1 + A + B

B

+ A + C

C

)

CBTW 3 log(1 + B ) log(1 + C ) CBTW 7 log(1 + A + B

) log(1 + A + C

C

)

B

CBTW 4 log[(1 + B )(1 + C )]

CBTW 8 log[(1 + A + B )(1 + A + C )]

Since the computing of either A/B or A/C has its intrinsic connection with cat-

egory membership, we propose a new term weighting scheme called Category-Based

Term Weights (CBTW) to replace the idf part in the classic tfidf weighting scheme,

and feature selection, a regular step in TC, is skipped in our experiments. Consid-

ering the probability foundation of A/B and A/C and the possibility of combining

them, the most immediate choice is to take the product of these two ratios. They are

named as CBTWn in Table 10.4, where n =1 , 3 , 5 , 7. However, we also include other

possibilities by extending them in another four different ways named as CBTWn,

where n =2 , 4 , 6 , 8.

Because we are not very sure which one can deliver better performance, these

eight combinations are evaluated in the benchmarking experiments reported in Sec-

tion 10.6.

10.5 Experimental Setup

Two data sets were tested in our experiment, i.e., MCV1 and Reuters-21578. MCV1

is an archive of 1434 English language manufacturing related engineering papers

which we gathered by the courtesy of the Society of Manufacturing Engineers (SME).

It combines all engineering technical papers published by SME from year 1998 to

year 2000. All documents were manually classified [27, 28]. There are a total of

18 major categories in MCV1. Figure 10.1 gives the class distribution in MCV1

. Reuters-21578 is a widely used benchmarking collection [40]. We followed Sun's

approach [41] in generating the category information. Figure 10.2 gives the class

distribution of the Reuters dataset used in our experiment. Unlike Sun [41], we did

not randomly sample negative examples from categories not belonging to any of the

categories in our dataset, instead we treated examples not from the target category

in our dataset as negatives.

We compared our weighting schemes experimentally with a number of other well

established weighting schemes, e.g., TFIDF, 'ltc' and normalized 'ltc,' on MCV1 and

Reuters-21578 using SVM as the classification algorithm. We also carried out the

benchmarking experiments between our conjectures and many other feature selection

methods, e.g., chi-square (ChiS), correlation coe cient (CC), odds ratio (OddsR),

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home