CLASS IMBALANCE AND ACTIVE LEARNING - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

it is possible to estimate this quantity directly. Let q enumerate over all possible

feature values that may be queried for labels. We can estimate the expected utility

of such a query by: EU(q) = k = 1 P(q = c k ) U (q = c k )/ω q ,where P(q = c k )

is the probability of the instance or feature queried being associated with class c k ,

ω q is the cost of query q ,and U is some measure of the utility of q . 13 This results

in the decision-theoretic optimal policy, which is to ask for feature labels which,

once incorporated into the data, will result in the highest increase in classification

performance in expectation [51, 63].

6.8.2.3 Active Dual Supervision ADS is concerned with situations where it is

possible to query an oracle for labels associated with both feature values and

examples. Even though such a paradigm is concerned with the simultaneous

acquisition of feature and example labels, the simplest approach is treating each

acquisition problem separately and then mixing the selections somehow. Active

interleaving performs a separate (un) certainty-based ordering on features and

on examples, and chooses selections from the top of each ordering according to

some predefined proportion. The different nature of feature value and example

uncertainty values lead to incompatible quantities existing on different scales,

preventing a single, unified ordering. However, expected utility can be used to

compute a single unified metric, encapsulating the value of both types of data

acquisition. As mentioned earlier, we are estimating the utility of a certain feature

of example query q as: EU(q)

= k = 1 P(q

c k )/ω q . Using a single

utility function for both features and examples and incorporating label acquisition

costs, costs and benefits of the different types of acquisition can be optimized

directly [51].

=

c k )

U

(q

=

6.9 CONCLUSION

This chapter presents a broad perspective on the relationship between AL—the

selective acquisition of labeled examples for training statistical models—and

imbalanced data classification tasks where at least one of the classes in the train-

ing set is represented with much fewer instances than the other classes. Our

comprehensive analysis of this relationship leads to the identification of two com-

mon associations, namely (i) is the ability of AL to deal with the data imbalance

problem that, when manifested in a training set, typically degrades the general-

ization performance of an induced model, and (ii) is the impact class imbalance

may have on the abilities of an otherwise reasonable AL scheme to select infor-

mative examples, a phenomenon that is particularly acute as the imbalance tends

toward the extreme.

Mitigating the impact of class imbalance on the generalization performance

of a predictive model, in Sections 6.3 and 6.4, we present AL as an alternative

to more conventional resampling strategies. An AL strategy may select a dataset

13 For instance, cross-validated accuracy or log-gain may be used.

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home