Information Technology Reference
In-Depth Information
categories [24, 40, 46]. Therefore, micro-averaging mostly yields much better results
than macro-averaging.
There have been several endeavors in handling imbalanced data sets in TC.
Here, we only focus on the approaches adopted in TC and group them based on
their primary focus. The first approach is based on sampling strategy. Yang [45] has
tested two sampling methods, i.e., proportion-enforced sampling and completeness-
driven sampling. Her empirical study using the ExpNet system shows that a global
sampling strategy which favors common categories over rare categories is critical for
the success of TC based on a statistical learning approach. Without such a global
control, the global optimal performance will be compromised and the learning e -
ciency can be substantially decreased. Nickerson et al. [33] provide a guided sampling
approach based on a clustering algorithm called Principal Direction Divisive Parti-
tioning to deal with the between-class imbalance problem. It has shown improvement
over existing methods of equalizing class imbalances, especially when there is a large
between-class imbalance together with severe imbalance in the relative densities of
the subcomponents of each class. Liu's recent efforts [25] in testing different sam-
pling strategies, i.e., under-sampling and over-sampling, and several classification
algorithms, i.e., Na ıve Bayes, k -Nearest Neighbors ( k NN) and Support Vector Ma-
chines (SVMs ), improve the understanding of interactions among sampling method,
classifier and performance measurement.
The second major effort emphasizes cost sensitive learning [10, 12, 44]. In many
real scenarios like risk management and medical diagnosis, making wrong decisions
are usually associated with very different costs. A wrong prediction of the nonexis-
tence of cancer, i.e., false negative, may lead to death, while the wrong prediction of
cancer existence, i.e., false positive, only results in unnecessary anxiety and medical
tests. In view of this, assigning different cost factors to false negatives and false
positives will lead to better performance with respect to positive (rare) classes [8].
Brank et al. [4] have reported their work on cost sensitive learning using SVMs on
TC. They obtain better results with methods that directly modify the score thresh-
old. They further propose a method based on the conditional class distributions for
SVM scores that works well when only very few training examples are available.
The recognition based approach, i.e., one-class learning, has provided another
class of solutions [18]. One-class learning aims to create the decision model based
on the examples of the target category alone, which is different from the typical
discriminative approach, i.e., the two classes setting. Manevitz and Yousef [30] have
applied one-class SVMs on TC. Raskutti and Kowalczyk [35] claim that one-class
learning is particularly helpful when data are extremely skewed and composed of
many irrelevant features and very high dimensionality.
Feature selection is often considered an important step in reducing the high
dimensionality of the feature space in TC and many other problems in image pro-
cessing and bioinformatics. However, its unique contribution in identifying the most
salient features to boost the performance of minor categories has not been stressed
until some recent work [31]. Yang [47] has given a detailed evaluation of several fea-
ture selection schemes. We noted the marked difference between micro-averaged and
macro-averaged values due to the poor performances over rare categories. Forman
[14] has done a very comprehensive study of various schemes for TC on a wide range
of commonly used test corpora. He has recommended the best pair among different
combinations of selection schemes and evaluation measures. The recent efforts from
Zheng et al. [50] advance the understanding of feature selection in TC. They show
Search WWH ::




Custom Search