Information Technology Reference
In-Depth Information
worse than theothertwo, and theodds ratio does notperforme as well as theother
TFFVs. Global based classicschemes, i.e., TFIDF, 'ltc', and normalized'ltc' form, do
notwork well for eitherMCV1orReuters-21578. A close lookattheir performance
reveals that classifiers builtfor minor categories, e.g., composites manufacturing,
electronics manufacturing and so on, do notproduce satisfactory results. This has
largelyaffectedtheoverall performance. Among TFFVs, odds ratio does notwork
very well. This is a surprise since theodds ratio is mentionedasoneoftheleading
feature selectionmethods fortext classificationintheliterature[37, 40]. This implies
that it is always worthwhile to reassess thestrength ofaterm selectionmethodfor
anew dataset,eveniftheyhave tendedto perform well in the past.
FromTable 10.5we also note that RF is actually somesort of simplifiedversion of
CBTWs where the ratioof A/C orthe information element C isexcluded. However,
performances generated over MCV1 and Reuters-21578 indicate that CBTWs are
notworse than RF. Infact,our experimental resultsshow that theeffectivecombi-
nation of A/B and A/C , i.e., CBTW 1 and CBTW 3 ,can lead to betterperformance
than RF. This has demonstratedthe practical valueof A/C and our aforementioned
conjectures about howtermscan be furtherdistinguished.
10.6.2 Gains for Minor Categories
Asshown in Figure 10.1 and Figure 10.2, both MCV1 and Reuters-21578 areskewed
data sets. While MCV1 possesses 18 categories with one major category occupying
up to25% ofthe whole population of supporting documents, there aresix categories
that own onlyaround 1% ofMCV1each and 11 categories falling belowthe average,
i.e., 5.5%, if MCV1 isevenlydistributed. Thesamecase also happens to the Reuters-
21578 dataset. While it has 13 categories, grain and crude, the two major categories,
share around half ofthe population. There areeight categories in total falling below
the average. Previousliteratures did notreport successfulstories overthese minor
categories [40, 41, 46].
Since our study shows that TFFV schemes work better than classic approaches,
weexamine why this is thecase. A close analysisshows that TFFVs display much
betterresultsover minor categories in both MCV1 and Reuters-21578 . We plot
their performances in Figures 10.4 and 10.5, respectively.Forall minor categories
shown in both figures, weobservedasharp increase ofperformance occurs when
thesystem's weighting method switches fromnormalized'ltc' to CBTWs and from
TFIDF to RF.
Based on Figure 10.3,since TFIDFoftenhelps SVM generate the bestperfor-
mance among the three classicones and CBTW 1 has the best overall performance
among the 16 weightingcandidates, wechose TFIDF and CBTW 1 as representa-
tiveof each group for further analysis. Both precision and recall of each individual
category in MCV1 and Reuters-21578 are plottedinFigures 10.6 and 10.7. Look-
ing at both figures reveals why TFFVs, in particular CBTW 1 , perform better. We
observed that in general CBTW 1 falls slightlybelowTFIDF in termsofprecision.
However, CBTW 1 performs far betterintermsofrecall and as aresult surpasses
TFIDF in termsof F 1 values. While the averagedprecision ofTFIDF in MCV1 is
0.8355 whichis about 5% higher than CBTW 1 , the averagedrecall of CBTW 1 is
0.7443, far superiorto TFIDF's 0.6006. Thecase with Reuters-21578 isevenmore
impressive. While the averagedprecision ofTFIDF is0.8982 whichisonly1.8%
Search WWH ::




Custom Search