Handling of Imbalanced Data in Text Classification: Category-Based Term Weights - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

tf · ( feature value ) (TFFV) to replace classic tfidf -based approaches. We then

propose Category-Based Term Weights (CBTWs), which directly make use of two

critical information ratios, as a new way to compute the feature value. These two

ratios are deemed to possess the most salient information about the category mem-

bership of terms and their computation does not impose any extra cost compared to

the conventional feature selection methods. Our experimental study and extensive

comparisons based on two imbalanced data sets, MCV1 and Reuters-21578, show

the merits of TFFV-based approaches and their ability to handle imbalanced data.

Among the various TFFVs, CBTW 1 offers the best overall performance in both data

sets. Our approach has provided an effective choice to improve TC performance for

imbalanced data. Furthermore, since CBTWs are derived from the understanding of

feature selection, they can also be viewed as new feature selection schemes to reflect

the relevance of terms with respect to thematic categories. Their joint application

with other algorithms in TC, e.g., Na ıve Bayes, k -Nearest Neighbors and Artificial

Neural Networks where feature selection is usually performed, needs further explo-

ration.

10.8 Acknowledgment

The authors would like to thank the reviewers for their valuable comments and

Ee-Peng Lim and Aixin Sun for much fruitful discussion.

References

1. Baeza-Yates R. & Ribeiro-Neto B. (1999) Modern information retrieval.

Addison-Wesley Longman Publishing Co. Inc., Boston, MA, USA

2. Baoli L., Qin L. & Shiwen Y. (2004) An adaptive k-nearest neighbor text cat-

egorization strategy. ACM Transactions on Asian Language Information Pro-

cessing (TALIP) 3:215-226

3. Blum A. & Mitchell T. (1998) Combining Labeled and Unlabeled Data with Co-

Training. In: COLT: Proceedings of the Workshop on Computational Learning

Theory

4. Brank J., Grobelnik M., Milic-Frayling N. & Mladenic D. (2003) Training text

classifiers with SVM on very few positive examples. Report MSR-TR-2003-34

5. Burges C. J. C. (1998) A tutorial on support vector machines for pattern recog-

nition. Data Mining and Knowledge Discovery 2:121-167

6. Castillo M. D. d. & Serrano J. I. (2004) A multistrategy approach for digital

text categorization from imbalanced documents. ACM SIGKDD Explorations

Newsletter: Special issue on learning from imbalanced datasets 6:70-79

7. Chawla N., Japkowicz N. & Kolcz A. (eds) (2003) Proceedings of the ICML'2003

Workshop on Learning from Imbalanced Data Sets

8. Chawla N., Japkowicz N. & Kolcz A. (eds) (2004) Special Issue on Learning

from Imbalanced Data Sets. ACM SIGKDD Explorations Newsletter 6

9. Debole F. & Sebastiani F. (2003) Supervised term weighting for automated

text categorization. In: Proceedings of the 2003 ACM Symposium on Applied

computing

Search WWH ::

Custom Search

Home