Information Technology Reference
In-Depth Information
tf · ( feature value ) (TFFV) to replace classic tfidf -based approaches. We then
propose Category-Based Term Weights (CBTWs), which directly make use of two
critical information ratios, as a new way to compute the feature value. These two
ratios are deemed to possess the most salient information about the category mem-
bership of terms and their computation does not impose any extra cost compared to
the conventional feature selection methods. Our experimental study and extensive
comparisons based on two imbalanced data sets, MCV1 and Reuters-21578, show
the merits of TFFV-based approaches and their ability to handle imbalanced data.
Among the various TFFVs, CBTW 1 offers the best overall performance in both data
sets. Our approach has provided an effective choice to improve TC performance for
imbalanced data. Furthermore, since CBTWs are derived from the understanding of
feature selection, they can also be viewed as new feature selection schemes to reflect
the relevance of terms with respect to thematic categories. Their joint application
with other algorithms in TC, e.g., Na ıve Bayes, k -Nearest Neighbors and Artificial
Neural Networks where feature selection is usually performed, needs further explo-
ration.
10.8 Acknowledgment
The authors would like to thank the reviewers for their valuable comments and
Ee-Peng Lim and Aixin Sun for much fruitful discussion.
References
1. Baeza-Yates R. & Ribeiro-Neto B. (1999) Modern information retrieval.
Addison-Wesley Longman Publishing Co. Inc., Boston, MA, USA
2. Baoli L., Qin L. & Shiwen Y. (2004) An adaptive k-nearest neighbor text cat-
egorization strategy. ACM Transactions on Asian Language Information Pro-
cessing (TALIP) 3:215-226
3. Blum A. & Mitchell T. (1998) Combining Labeled and Unlabeled Data with Co-
Training. In: COLT: Proceedings of the Workshop on Computational Learning
Theory
4. Brank J., Grobelnik M., Milic-Frayling N. & Mladenic D. (2003) Training text
classifiers with SVM on very few positive examples. Report MSR-TR-2003-34
5. Burges C. J. C. (1998) A tutorial on support vector machines for pattern recog-
nition. Data Mining and Knowledge Discovery 2:121-167
6. Castillo M. D. d. & Serrano J. I. (2004) A multistrategy approach for digital
text categorization from imbalanced documents. ACM SIGKDD Explorations
Newsletter: Special issue on learning from imbalanced datasets 6:70-79
7. Chawla N., Japkowicz N. & Kolcz A. (eds) (2003) Proceedings of the ICML'2003
Workshop on Learning from Imbalanced Data Sets
8. Chawla N., Japkowicz N. & Kolcz A. (eds) (2004) Special Issue on Learning
from Imbalanced Data Sets. ACM SIGKDD Explorations Newsletter 6
9. Debole F. & Sebastiani F. (2003) Supervised term weighting for automated
text categorization. In: Proceedings of the 2003 ACM Symposium on Applied
computing
Search WWH ::




Custom Search