Information Technology Reference
In-Depth Information
3.3.1 Cost-Sensitive Learning
In cost-sensitive learning instead of each instance being either correctly or
incorrectly classified, each class (or instance) is given a misclassification cost .
Thus, instead of trying to optimize the accuracy, the problem is then to minimize
the total misclassification cost. Many traditional methods are easily extensible
with this goal in mind. Support vector machines (SVMs), for instance, can be
easily modified to follow the cost-sensitive framework.
To see how this is done, consider the problem of learning an SVM. SVMs
attempt to learn a weight vector w that satisfies the following optimization prob-
lem:
1
2 ||
ξ i
n
2
min
w,ξ,b
w
||
+
C
(3.1)
i
=
1
subject to (for all i ∈{ 1 ,...,n } ):
y i ( w · x i b) 1 ξ i
(3.2)
ξ i 0
(3.3)
where x i denotes the instance i , y i denotes the class of instance i , ξ i denotes the
“slack,” for instance, i (i.e., how badly misclassified, if at all, instance i is by
w ), and C determines the trade-off between training error and model complexity.
In order to make SVMs cost sensitive, the objective function is modified
resulting in:
1
2 || w ||
c i ξ i ,
n
2
min
w,ξ,b
+ C
(3.4)
i
=
1
where c i is the misclassification cost, for instance i .
In addition to modifying traditional learning algorithms, cost-sensitive ensem-
ble methods have also been developed. AdaCost [27] and cost-sensitive boosting
(CSB) [28] are two extensions of AdaBoost, which incorporate the misclassifica-
tion cost of an instance in order to provide more accurate instance weights and
not just weights derived from misclassification error as done by AdaBoost.
3.3.2 Skew-Insensitive Learners
While cost-sensitive learning is a popular way of extending classifiers for the use
on the class imbalance problem, some classifiers allow for easier extensions that
directly combat the problem of class imbalance.
Naıve Bayes, for instance, is trivially skew insensitive. This is due to the fact
that it makes predictions p(y | x i ) by first computing p( x i | y) and p(y) from the
training data. Bayes rule is then applied to obtain a classification, for instance,
x i as: p(y | x i ) = p( x i | y) · p(y) .Since p( x i | y) is difficult to calculate in general,
a simplifying assumption is made (the “naıve” in “naıve Bayes”), namely, each
Search WWH ::




Custom Search