Handling of Imbalanced Data in Text Classification: Category-Based Term Weights - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

10

Handling of Imbalanced Data in Text

Classification: Category-Based Term Weights

Ying Liu, Han Tong Loh, Kamal Youcef-Toumi, and Shu Beng Tor

10.1 Introduction

Learning from imbalanced data has emerged as a new challenge to the machine

learning (ML), data mining (DM) and text mining (TM) communities. Two recent

workshops in 2000 [17] and 2003 [7] at AAAI and ICML conferences respectively and

a special issue in ACM SIGKDD explorations [8] are dedicated to this topic. It has

been witnessing growing interest and attention among researchers and practitioners

seeking solutions in handling imbalanced data. An excellent review of the state-of-

the-art is given by Gary Weiss [43].

The data imbalance problem often occurs in classification and clustering scenar-

ios when a portion of the classes possesses many more examples than others. As

pointed out by Chawla et al. [8] when standard classification algorithms are applied

to such skewed data, they tend to be overwhelmed by the major categories and

ignore the minor ones. There are two main reasons why the uneven cases happen.

One is due to the intrinsic nature of such events as credit fraud, cancer detection,

network intrusion, earthquake prediction and so on [8]. These are rare events pre-

sented as a unique category but can only occupy a very small portion of the entire

example space. Another case is due to the expense of collecting learning examples

and legal or privacy reasons. In our previous endeavor of building a manufactur-

ing centered technical paper corpus [27, 28], due to the costly efforts demanded for

human labeling and diverse interests in the papers, we ended up naturally with a

skewed collection.

Automatic text classification (TC) has recently witnessed a booming interest,

due to the increased availability of documents in digital form and the ensuing need

to organize them [40]. In TC tasks, given that most test collections are composed

of documents belonging to multiple classes, the performance is usually reported in

terms of micro-averaged and macro-averaged scores [40, 46]. Macro-averaging gives

equal weights to the scores generated from each individual category. In comparison,

micro-averaging tends to be dominated by the categories with more positive training

instances. Due to the fact that many of these test corpora used in TC are either

naturally skewed or artificially imbalanced especially in the binary and so called 'one-

against-all' settings, classifiers often perform far less than satisfactorily for minor

Search WWH ::

Custom Search

Home