Information Technology Reference
In-Depth Information
information gain (IG) and relevance frequency (RF) [30], by replacing the idf term
with the feature selectionvalue in theclassic tfidf weightingschemes. Therefore,
schemes arelargelyformulatedinaform as tf · ( feature value )(TFFV). Table 10.5
shows all 16 weightingschemes testedinour experiments and their mathematic for-
mations. Please note that basicallythe majority ofTFFV schemes arecomposed of
two items, i.e., the normaliz ed term frequency tf ( t i ,d j ) / max[ tf ( d j )] and the term's
feature value, e.g.,
N ( AD−BC )
( A + C )( B + D )( A + B )( C + D )
in thecorrelation coe cient scheme,
where tf ( t i ,d j ) is the frequency ofterm t i in the document d j and max[ tf ( d j )]is
the maximum frequency ofaterm in the document d j . Theonlydifferent ones are
TFIDF weighting, 'ltc' form and the normalized'ltc' form asspecifiedinTable 10.5.
Table 10.5. All weightingschemes testedintheexperiments and their mathematic
formations, where the normalizedterm frequency ntf is definedas
tf ( t i ,d j )
max[ tf ( d j )]
Weighting Scheme
Name
MathematicF or mations
N ( AD−BC )
( A + C )( B + D )( A + B )( C + D )
tf· CorrelationCoef.
CC
ntf ·
N ( AD−BC ) 2
( A + C )( B + D )( A + B )( C + D )
tf· Chi-square
ChiS
ntf ·
ntf · ( N log
AN
( A + B )( A + C )
+ N log
CN
( C + D )( A + C )
tf· Information Gain
IG
)
tf· Odds Ratio
OddsR
ntf · log( AD/BC )
ntf · log(1+ A + B
B
tf· Relevance Freq.
RF
)
N
N ( t
TFIDF 1]
TFIDF
ntf · log(
)
i )
N
N ( t
tfidf -ltc
ltc
tf ( t i ,d j ) · log(
)
i )
tf idf ltc
Normalized ltc
nltc
ntf · log(1+ B
A
C )
Cat. BasedTerm Wt. 1 CBTW 1
log(1+ B + C )
Cat. BasedTerm Wt.2CBTW 2
ntf
·
ntf · log(1+ B ) log(1+ C )
Cat. BasedTerm Wt. 3 CBTW 3
ntf · log[(1+ B )(1+ C )]
Cat. BasedTerm Wt. 4 CBTW 4
ntf · log(1+ A + B
B
A + C
C )
Cat. BasedTerm Wt. 5 CBTW 5
ntf · log(1+ A + B
B
+ A + C
C
Cat. BasedTerm Wt. 6 CBTW 6
)
Cat. BasedTerm Wt. 7 CBTW 7
ntf · log(1+ A + B
B
) log(1+ A + C
C
)
Cat. BasedTerm Wt. 8 CBTW 8
ntf · log[(1+ A + B
B
)(1+ A + C
C
)]
Major standard text preprocessingsteps were appliedinour experiments, includ-
ingstopword and punctuationremoval, and stemming. However, feature selection
wasskipped and all termsleft after stopword and punctuationremoval were kept
as features. In our experiments we usedthe SVM implementation calledSVM Light
[19,20]. We usedthelinear functionas its kernel function,since previous work has
shown that thelinear function can deliver evenbetterperformance without tedious
parameter tuning in TC [19,21]. As forthe performance measurement, precision,
recall and the harmoniccombination ofprecision and recall, i.e., the F 1 value, were
calculated [1, 36]. Performance was assessedbased on five-fold cross validation. Since
Search WWH ::




Custom Search