Information Technology Reference
In-Depth Information
information gain (IG) and relevance frequency (RF) [30], by replacing the
idf
term
with the feature selectionvalue in theclassic
tfidf
weightingschemes. Therefore,
schemes arelargelyformulatedinaform as
tf ·
(
feature value
)(TFFV). Table 10.5
shows all 16 weightingschemes testedinour experiments and their mathematic for-
mations. Please note that basicallythe majority ofTFFV schemes arecomposed of
two items, i.e., the normaliz
ed
term frequency
tf
(
t
i
,d
j
)
/
max[
tf
(
d
j
)] and the term's
feature value, e.g.,
√
N
(
AD−BC
)
√
(
A
+
C
)(
B
+
D
)(
A
+
B
)(
C
+
D
)
in thecorrelation coe
cient scheme,
where
tf
(
t
i
,d
j
) is the frequency ofterm
t
i
in the document
d
j
and max[
tf
(
d
j
)]is
the maximum frequency ofaterm in the document
d
j
. Theonlydifferent ones are
TFIDF weighting, 'ltc' form and the normalized'ltc' form asspecifiedinTable 10.5.
Table 10.5.
All weightingschemes testedintheexperiments and their mathematic
formations, where the normalizedterm frequency
ntf
is definedas
tf
(
t
i
,d
j
)
max[
tf
(
d
j
)]
Weighting Scheme
Name
MathematicF
or
mations
√
N
(
AD−BC
)
√
(
A
+
C
)(
B
+
D
)(
A
+
B
)(
C
+
D
)
tf·
CorrelationCoef.
CC
ntf ·
N
(
AD−BC
)
2
(
A
+
C
)(
B
+
D
)(
A
+
B
)(
C
+
D
)
tf·
Chi-square
ChiS
ntf ·
ntf ·
(
N
log
AN
(
A
+
B
)(
A
+
C
)
+
N
log
CN
(
C
+
D
)(
A
+
C
)
tf·
Information Gain
IG
)
tf·
Odds Ratio
OddsR
ntf ·
log(
AD/BC
)
ntf ·
log(1+
A
+
B
B
tf·
Relevance Freq.
RF
)
N
N
(
t
TFIDF 1]
TFIDF
ntf ·
log(
)
i
)
N
N
(
t
tfidf
-ltc
ltc
tf
(
t
i
,d
j
)
·
log(
)
i
)
tf idf
ltc
Normalized ltc
nltc
ntf ·
log(1+
B
A
C
)
Cat. BasedTerm Wt. 1 CBTW
1
log(1+
B
+
C
)
Cat. BasedTerm Wt.2CBTW
2
ntf
·
ntf ·
log(1+
B
) log(1+
C
)
Cat. BasedTerm Wt. 3 CBTW
3
ntf ·
log[(1+
B
)(1+
C
)]
Cat. BasedTerm Wt. 4 CBTW
4
ntf ·
log(1+
A
+
B
B
A
+
C
C
)
Cat. BasedTerm Wt. 5 CBTW
5
ntf ·
log(1+
A
+
B
B
+
A
+
C
C
Cat. BasedTerm Wt. 6 CBTW
6
)
Cat. BasedTerm Wt. 7 CBTW
7
ntf ·
log(1+
A
+
B
B
) log(1+
A
+
C
C
)
Cat. BasedTerm Wt. 8 CBTW
8
ntf ·
log[(1+
A
+
B
B
)(1+
A
+
C
C
)]
Major standard text preprocessingsteps were appliedinour experiments, includ-
ingstopword and punctuationremoval, and stemming. However, feature selection
wasskipped and all termsleft after stopword and punctuationremoval were kept
as features. In our experiments we usedthe SVM implementation calledSVM
Light
[19,20]. We usedthelinear functionas its kernel function,since previous work has
shown that thelinear function can deliver evenbetterperformance without tedious
parameter tuning in TC [19,21]. As forthe performance measurement, precision,
recall and the harmoniccombination ofprecision and recall, i.e., the
F
1
value, were
calculated [1, 36]. Performance was assessedbased on five-fold cross validation. Since
Search WWH ::
Custom Search