NONSTATIONARY STREAM DATA LEARNING WITH IMBALANCED CLASS DISTRIBUTION - Imbalanced Learning: Foundations, Algorithms, and Applications - page 164

Information Technology Reference

In-Depth Information

OA is usually adopted in the traditional learning scenario, that is, static datasets

with balanced class distribution, to evaluate the performance of algorithms. How-

ever, when the context changes to imbalanced learning, it is wise to apply other

metrics for such evaluation [19], among which receiver operation characteris-

tics (ROC) curve and area under ROC curve (AUROC) are the most strongly

recommended [36].

On the basis of the confusion matrix as defined in Figure 7.4, one can calculate

the TP rate and FP rate as follows:

TP

P R =

TP

TP + FN

TP rate

=

(7.32)

FP

N R =

FP

FP + TN

FP rate =

(7.33)

ROC space is established by plotting TP rate over FP rate. Generally speak-

ing, hard-type classifiers (those that output only discrete class labels) correspond

to points in ROC space (FP rate, TP rate). On the other hand, soft-type classi-

fiers (those that output a likelihood that an instance belongs to either class label)

correspond to curves in ROC space. Such curves are formulated by adjusting

the decision threshold to generate a series of points in ROC space. For example,

if the likelihoods of an unlabeled instance x k belonging to minority class and

majority class are 0 . 3and0 . 7, respectively, natural decision threshold d

0 . 5

would classify x k as a majority class example as 0 . 3 <d . However, d could also

be set to other values, for example, d

=

0 . 2. In this case, x k would be classified

as a minority class example as 0 . 3 >d . By tuning d from 0 to 1 with a small

step , for example, = 0 . 01, a series of pair-wise points (FP rate, TP rate)

could be created in ROC space. In order to assess the performance of different

classifiers in this case, one generally uses AUROC as an evaluation criterion;

it is defined as the area between the ROC curve and the horizontal axis (axis

representing FP rate).

In order to reflect the ROC curve characteristics for all random runs, the

vertical averaging approach [36] is adopted to plot the averaged ROC curves.

Implementation of the vertical averaging method is illustrated in Figure 7.5.

Assume one would like to average two ROC curves, l 1 and l 2 ; both are formed

by a series of points in the ROC space. The first step is to evenly divide the range

of FP rate into a set of intervals. Then at each interval, find the corresponding

TP rate values of each ROC curve and average them. In Figure 7.5, X 1 and

Y 1 are the points from l 1 and l 2 corresponding to the interval FP rate 1. By

averaging their TP rate values, the corresponding ROC point Z 1 on the averaged

ROC curve is obtained. However, there exist some ROC curves that do not have

corresponding points on certain intervals. In this case, one can use the linear

interpolation met hod to obtain the averaged ROC points. For instance, in Figure

7.5, the point X (corresponding to FP rate 2) is calculated on th e basis of the

linear interpolation of the two neighboring points X 2 and X 3 .Once X is obtained,

=

Next Page

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home