CLASS IMBALANCE AND ACTIVE LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

Of course the benefit of adding additional examples on a test dataset is

unknown. Furthermore, the impact of a particular class's examples may vary

depending on the feature values of particular instances. In order to cope with

these issues, we can estimate via cross-validation on the training set. Using sam-

pling, we can try various class-conditional additions and compute the expected

benefit of a class across that class's representatives in T , assessed on the testing

folds. The earlier-mentioned utility then becomes:

U(c) = E x ∈ c 1

|D|

P T ∪ c (c i | x) cost (c i | y) .

1

|D|

P T (c i | x) cost (c i | y) −

x

∈D

i

x

∈D

i

Note that it is often preferred to add examples in batch. In this case, we may

wish to sample from the classes in proportion to their respective utilities:

U(c)

c

p t

U (c) ∝

.

U(c)

Further, diverse class-conditional acquisition costs can be incorporated,

utilizing U(c)/ω c in place of U(c) , where ω c is the (expected) cost of acquiring

the feature vector of an example in class c .

6.8.1.3 Alternative Approaches to ACS In addition to uncertainty-based and

utility-based techniques, there are several alternative techniques for performing

ACS. Motivated by empirical results showing that barring any domain-specific

information, when collecting examples for a training set of size n , a balanced

class distribution tends to offer reasonable AUC on test data [43, 47], a reasonable

baseline approach to ACS is simply to select classes in balanced proportion.

Search strategies may alternately be employed in order to reveal the most

effective class ratio at each epoch. Utilizing a nested cross-validation on the

training set, the space of class ratios can be explored, with the most favorable

ratio being utilized at each epoch. Note that it is not possible to explore all

possible class ratios in all epochs, without eventually spending too much on

one class or another. Thus, as we approach n , we can narrow the range of class

ratios, assuming that there is a problem-optimal class ratio that will become more

apparent as we obtain more data [43].

It should be noted that many techniques employed for building classification

models assume an identical or similar training and test distribution. Violating this

assumption may lead to biased predictions on test data where classes preferen-

tially represented in the training data are predicted more frequently. In particular,

increasing the prior probability of a class increases the posterior probability of

the class, moving the classification boundary for that class so that more cases

are classified into that class” [48, 49]. Thus in settings where instances are

selected specifically in proportions different from those seen in the wild, posterior

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home