Handling of Imbalanced Data in Text Classification: Category-Based Term Weights - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

information gain (IG) and relevance frequency (RF) [30], by replacing the idf term

with the feature selectionvalue in theclassic tfidf weightingschemes. Therefore,

schemes arelargelyformulatedinaform as tf · ( feature value )(TFFV). Table 10.5

shows all 16 weightingschemes testedinour experiments and their mathematic for-

mations. Please note that basicallythe majority ofTFFV schemes arecomposed of

two items, i.e., the normaliz ed term frequency tf ( t i ,d j ) / max[ tf ( d j )] and the term's

feature value, e.g.,

√ N ( AD−BC )

√ ( A + C )( B + D )( A + B )( C + D )

in thecorrelation coe cient scheme,

where tf ( t i ,d j ) is the frequency ofterm t i in the document d j and max[ tf ( d j )]is

the maximum frequency ofaterm in the document d j . Theonlydifferent ones are

TFIDF weighting, 'ltc' form and the normalized'ltc' form asspecifiedinTable 10.5.

Table 10.5. All weightingschemes testedintheexperiments and their mathematic

formations, where the normalizedterm frequency ntf is definedas

tf ( t i ,d j )

max[ tf ( d j )]

Weighting Scheme

Name

MathematicF or mations

√ N ( AD−BC )

√ ( A + C )( B + D )( A + B )( C + D )

tf· CorrelationCoef.

ntf ·

N ( AD−BC ) 2

( A + C )( B + D )( A + B )( C + D )

tf· Chi-square

ChiS

ntf ·

ntf · ( N log

( A + B )( A + C )

+ N log

( C + D )( A + C )

tf· Information Gain

)

tf· Odds Ratio

OddsR

ntf · log( AD/BC )

ntf · log(1+ A + B

tf· Relevance Freq.

)

N ( t

TFIDF 1]

TFIDF

ntf · log(

)

i )

N ( t

tfidf -ltc

ltc

tf ( t i ,d j ) · log(

)

i )

tf idf ltc

Normalized ltc

nltc

ntf · log(1+ B

C )

Cat. BasedTerm Wt. 1 CBTW 1

log(1+ B + C )

Cat. BasedTerm Wt.2CBTW 2

ntf

ntf · log(1+ B ) log(1+ C )

Cat. BasedTerm Wt. 3 CBTW 3

ntf · log[(1+ B )(1+ C )]

Cat. BasedTerm Wt. 4 CBTW 4

ntf · log(1+ A + B

A + C

C )

Cat. BasedTerm Wt. 5 CBTW 5

ntf · log(1+ A + B

+ A + C

Cat. BasedTerm Wt. 6 CBTW 6

)

Cat. BasedTerm Wt. 7 CBTW 7

ntf · log(1+ A + B

) log(1+ A + C

)

Cat. BasedTerm Wt. 8 CBTW 8

ntf · log[(1+ A + B

)(1+ A + C

)]

Major standard text preprocessingsteps were appliedinour experiments, includ-

ingstopword and punctuationremoval, and stemming. However, feature selection

wasskipped and all termsleft after stopword and punctuationremoval were kept

as features. In our experiments we usedthe SVM implementation calledSVM Light

[19,20]. We usedthelinear functionas its kernel function,since previous work has

shown that thelinear function can deliver evenbetterperformance without tedious

parameter tuning in TC [19,21]. As forthe performance measurement, precision,

recall and the harmoniccombination ofprecision and recall, i.e., the F 1 value, were

calculated [1, 36]. Performance was assessedbased on five-fold cross validation. Since

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home