Information Technology Reference
In-Depth Information
c i . Surely, the higher the ratio, the more important the term t k is related to category
c i . A similar analysis can be be made with respect to A/C . The ratio reflects the
expectation that a term is deemed as more relevant if it occurs in the larger portion
of documents from category c i than other terms.
Table 10.4. The different combinations of Category-Based Term Weights (CBTW),
the mathematical forms are represented by information elements shown in Table 10.2
CBTW 1 log(1 + B
A
C )
CBTW 5 log(1 + A + B
B
A + C
C
)
CBTW 2 log(1 + B + C )
CBTW 6 log(1 + A + B
B
+ A + C
C
)
CBTW 3 log(1 + B ) log(1 + C ) CBTW 7 log(1 + A + B
) log(1 + A + C
C
)
B
CBTW 4 log[(1 + B )(1 + C )]
CBTW 8 log[(1 + A + B )(1 + A + C )]
Since the computing of either A/B or A/C has its intrinsic connection with cat-
egory membership, we propose a new term weighting scheme called Category-Based
Term Weights (CBTW) to replace the idf part in the classic tfidf weighting scheme,
and feature selection, a regular step in TC, is skipped in our experiments. Consid-
ering the probability foundation of A/B and A/C and the possibility of combining
them, the most immediate choice is to take the product of these two ratios. They are
named as CBTWn in Table 10.4, where n =1 , 3 , 5 , 7. However, we also include other
possibilities by extending them in another four different ways named as CBTWn,
where n =2 , 4 , 6 , 8.
Because we are not very sure which one can deliver better performance, these
eight combinations are evaluated in the benchmarking experiments reported in Sec-
tion 10.6.
10.5 Experimental Setup
Two data sets were tested in our experiment, i.e., MCV1 and Reuters-21578. MCV1
is an archive of 1434 English language manufacturing related engineering papers
which we gathered by the courtesy of the Society of Manufacturing Engineers (SME).
It combines all engineering technical papers published by SME from year 1998 to
year 2000. All documents were manually classified [27, 28]. There are a total of
18 major categories in MCV1. Figure 10.1 gives the class distribution in MCV1
. Reuters-21578 is a widely used benchmarking collection [40]. We followed Sun's
approach [41] in generating the category information. Figure 10.2 gives the class
distribution of the Reuters dataset used in our experiment. Unlike Sun [41], we did
not randomly sample negative examples from categories not belonging to any of the
categories in our dataset, instead we treated examples not from the target category
in our dataset as negatives.
We compared our weighting schemes experimentally with a number of other well
established weighting schemes, e.g., TFIDF, 'ltc' and normalized 'ltc,' on MCV1 and
Reuters-21578 using SVM as the classification algorithm. We also carried out the
benchmarking experiments between our conjectures and many other feature selection
methods, e.g., chi-square (ChiS), correlation coe cient (CC), odds ratio (OddsR),
 
Search WWH ::




Custom Search