Information Technology Reference
In-Depth Information
where examples are drawn from some available unlabeled pool in accordance
to some predefined class proportion. We then go on into Section 6.8.2 to touch
on active feature labeling (AFL) and active dual supervision (ADS). These two
paradigms attempt to replace or supplement traditional supervised learning with
class-specific associations on certain feature values. While this set of techniques
requires specialized models, significant generalization performance can often be
achieved at a reasonable cost by leveraging explicit feature/class relationships.
This is often appealing in the active setting, where it is occasionally less chal-
lenging to identify class-indicative feature values than it is to find quality training
data for labeling, particularly in the imbalanced setting.
6.8.1 Class-Conditional Example Acquisition
Imagine as an alternative to the traditional AL problem setting, where an oracle
is queried in order to assign examples to specially selected unlabeled examples,
a setting where an oracle is charged with selecting exemplars from the underly-
ing problem space in accordance to some predefined class ratio. Consider as a
motivational example, the problem of building predictive models based on data
collected through an “artificial nose” with the intent of “sniffing out” explosive or
hazardous chemical compounds [38-40]. In this setting, the reactivity of a large
number of chemicals is already known, representing label-conditioned pools of
available instances. However, producing these chemicals in a laboratory setting
and running the resultant compound through the artificial nose may be an expen-
sive, time-consuming process. While this problem may seem quite unique, many
data acquisition tasks may be cast into a similar framework.
A much more general issue in selective data acquisition is the amount of
control ceded to the “oracle” doing the acquisition. The work discussed so far
assumes that an oracle will be queried for some specific value, and the oracle
simply returns that value. However, if the oracle is actually a person, he or she
may be able to apply considerable intelligence and other resources to “guide” the
selection. Such guidance is especially helpful in situations where some aspect of
the data is rare—where purely data-driven strategies are particularly challenged.
As discussed throughout this work, in many practical settings, one class
is quite rare. As an example motivating the application of class-conditional
example acquisition in practice, consider building a predictive model from scratch
designed to classify web pages containing a particular topic of interest. While
large absolute numbers of such web pages may be present on the web, they may
be outnumbered by uninteresting pages by a million to one or worse (take, for
instance, the task of detecting and removing hate speech from the web [33]). As
discussed in Section 6.5, such extremely imbalanced problem settings present a
particularly insidious difficulty for traditional AL techniques. In a setting with
a10 , 000 : 1 class ratio, a reasonably large labeling budget could be expended
without observing a single minority example. 9
9 Note that in practice, such extremely imbalanced problem settings may actually be quite common.
Search WWH ::




Custom Search