Information Technology Reference
In-Depth Information
10
Handling of Imbalanced Data in Text
Classification: Category-Based Term Weights
Ying Liu, Han Tong Loh, Kamal Youcef-Toumi, and Shu Beng Tor
10.1 Introduction
Learning from imbalanced data has emerged as a new challenge to the machine
learning (ML), data mining (DM) and text mining (TM) communities. Two recent
workshops in 2000 [17] and 2003 [7] at AAAI and ICML conferences respectively and
a special issue in ACM SIGKDD explorations [8] are dedicated to this topic. It has
been witnessing growing interest and attention among researchers and practitioners
seeking solutions in handling imbalanced data. An excellent review of the state-of-
the-art is given by Gary Weiss [43].
The data imbalance problem often occurs in classification and clustering scenar-
ios when a portion of the classes possesses many more examples than others. As
pointed out by Chawla et al. [8] when standard classification algorithms are applied
to such skewed data, they tend to be overwhelmed by the major categories and
ignore the minor ones. There are two main reasons why the uneven cases happen.
One is due to the intrinsic nature of such events as credit fraud, cancer detection,
network intrusion, earthquake prediction and so on [8]. These are rare events pre-
sented as a unique category but can only occupy a very small portion of the entire
example space. Another case is due to the expense of collecting learning examples
and legal or privacy reasons. In our previous endeavor of building a manufactur-
ing centered technical paper corpus [27, 28], due to the costly efforts demanded for
human labeling and diverse interests in the papers, we ended up naturally with a
skewed collection.
Automatic text classification (TC) has recently witnessed a booming interest,
due to the increased availability of documents in digital form and the ensuing need
to organize them [40]. In TC tasks, given that most test collections are composed
of documents belonging to multiple classes, the performance is usually reported in
terms of micro-averaged and macro-averaged scores [40, 46]. Macro-averaging gives
equal weights to the scores generated from each individual category. In comparison,
micro-averaging tends to be dominated by the categories with more positive training
instances. Due to the fact that many of these test corpora used in TC are either
naturally skewed or artificially imbalanced especially in the binary and so called 'one-
against-all' settings, classifiers often perform far less than satisfactorily for minor
Search WWH ::




Custom Search