Information Technology Reference
In-Depth Information
the merits and great potential of explicitly combining positive and negative features
in a nearly optimal fashion according to the imbalanced data.
Some recent work simply adapting existing ML techniques and not even directly
targeting the issue of class imbalance have shown great potential with respect to
the data imbalance problem. Castillo and Serrano [6], and Fan et al. [13] have re-
ported the success using an ensemble approach, e.g., voting and boosting, to handle
skewed data distribution. Challenged by real industry data with a huge number of
records and an extremely skewed data distribution, Fan's work shows that the en-
semble approach is capable of improving the performance on rare classes. In their
approaches, a set of weak classifiers using various learning algorithms are built up
over minor categories. The final decision is reached based on the combination of
outcomes from different classifiers. Another promising approach which receives less
attention falls into the category of semi-supervised learning or weakly supervised
learning [3, 15, 16, 23, 26, 34, 48, 49]. The basic idea is to identify more positive
examples from a large amount of unknown data. These approaches are especially
viable when unlabeled data are steadily available. The last effort attacking the im-
balance problem uses parameter tuning in k NNs [2]. The authors expect to set k
dynamically according to the data distribution, in which a large k is granted given
a minor category.
In this chapter, we tackle the data imbalance problem from a different angle.
We present a novel approach assigning better weights to the features from minor
categories. Inspired by the merits of feature selection, we base our approach to
identifying the most salient features for a category on the classic term weighting
scheme, i.e., tfidf and propose several weighting factors called Category-Based Term
Weights (CBTW) to replace the idf term in the classic tfidf form. The experiment
setup is explained in Section 10.5. We carry out the evaluation and comparison of
our CBTWs with many other different weighting forms over two skewed data sets,
including Reuters-21578. We explain the experimental findings and discuss their
performance in Section 10.6. Finally, we give our conclusions in Section 10.7.
10.2 Term Weighting Scheme
TC is the process of categorizing documents into predefined thematic categories. In
its current practice, which is dominated by supervised learning, the construction of
a text classifier is often conducted in two main phases [9, 40]:
a) Document indexing - the creation of numeric representations of the documents
Term selection - selecting a subset of terms from all terms occurring in the
collection to represent the documents in a better way, either to facilitate
computing or to achieve best effectiveness in classification
Term weighting - assigning a numeric value to each term in order that rep-
resents its contribution to making a document stand out from others
b) Classifier learning - the building of a classifier by learning from the numeric
representations of the documents
In information retrieval and machine learning, term weighting has long been
formulated as term frequency · inverse documents frequency , i.e., tfidf [1, 36,
38, 39]. The more popular ”ltc” form [1, 38, 39] is given by,
Search WWH ::




Custom Search