Handling of Imbalanced Data in Text Classification: Category-Based Term Weights - Natural Language Processing and Text Mining

Information Technology Reference

In-Depth Information

categories [24, 40, 46]. Therefore, micro-averaging mostly yields much better results

than macro-averaging.

There have been several endeavors in handling imbalanced data sets in TC.

Here, we only focus on the approaches adopted in TC and group them based on

their primary focus. The first approach is based on sampling strategy. Yang [45] has

tested two sampling methods, i.e., proportion-enforced sampling and completeness-

driven sampling. Her empirical study using the ExpNet system shows that a global

sampling strategy which favors common categories over rare categories is critical for

the success of TC based on a statistical learning approach. Without such a global

control, the global optimal performance will be compromised and the learning e -

ciency can be substantially decreased. Nickerson et al. [33] provide a guided sampling

approach based on a clustering algorithm called Principal Direction Divisive Parti-

tioning to deal with the between-class imbalance problem. It has shown improvement

over existing methods of equalizing class imbalances, especially when there is a large

between-class imbalance together with severe imbalance in the relative densities of

the subcomponents of each class. Liu's recent efforts [25] in testing different sam-

pling strategies, i.e., under-sampling and over-sampling, and several classification

algorithms, i.e., Na ıve Bayes, k -Nearest Neighbors ( k NN) and Support Vector Ma-

chines (SVMs ), improve the understanding of interactions among sampling method,

classifier and performance measurement.

The second major effort emphasizes cost sensitive learning [10, 12, 44]. In many

real scenarios like risk management and medical diagnosis, making wrong decisions

are usually associated with very different costs. A wrong prediction of the nonexis-

tence of cancer, i.e., false negative, may lead to death, while the wrong prediction of

cancer existence, i.e., false positive, only results in unnecessary anxiety and medical

tests. In view of this, assigning different cost factors to false negatives and false

positives will lead to better performance with respect to positive (rare) classes [8].

Brank et al. [4] have reported their work on cost sensitive learning using SVMs on

TC. They obtain better results with methods that directly modify the score thresh-

old. They further propose a method based on the conditional class distributions for

SVM scores that works well when only very few training examples are available.

The recognition based approach, i.e., one-class learning, has provided another

class of solutions [18]. One-class learning aims to create the decision model based

on the examples of the target category alone, which is different from the typical

discriminative approach, i.e., the two classes setting. Manevitz and Yousef [30] have

applied one-class SVMs on TC. Raskutti and Kowalczyk [35] claim that one-class

learning is particularly helpful when data are extremely skewed and composed of

many irrelevant features and very high dimensionality.

Feature selection is often considered an important step in reducing the high

dimensionality of the feature space in TC and many other problems in image pro-

cessing and bioinformatics. However, its unique contribution in identifying the most

salient features to boost the performance of minor categories has not been stressed

until some recent work [31]. Yang [47] has given a detailed evaluation of several fea-

ture selection schemes. We noted the marked difference between micro-averaged and

macro-averaged values due to the poor performances over rare categories. Forman

[14] has done a very comprehensive study of various schemes for TC on a wide range

of commonly used test corpora. He has recommended the best pair among different

combinations of selection schemes and evaluation measures. The recent efforts from

Zheng et al. [50] advance the understanding of feature selection in TC. They show

Natural Language Processing and Text Mining

Search WWH ::

Custom Search

Home