NONSTATIONARY STREAM DATA LEARNING WITH IMBALANCED CLASS DISTRIBUTION - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

block consists of 10 data chunks, each of which has 1000 examples as the training

dataset and 200 instances as the testing dataset. Examples with the sum of the

two features greater than the threshold belong to the majority class, and those

otherwise reside in the minority class. The number of generated minority class

data is restricted to be 1 / 20 of the total number of data in the corresponding data

chunk. In other words, the imbalanced ratio is set to be 0 . 05 in our simulation.

In order to introduce some uncertainty/noise into the dataset, 1% of the examples

inside each training dataset are randomly sampled to reverse their class labels. In

this way, approximately 1 / 6 of the minority examples are erroneously labeled,

which raises a challenge on handling noise for all comparative algorithms learning

from this dataset.

The simulation results for the SEA dataset are averaged over 10 random runs.

During each random run, the dataset is basically generated all over again. To

view the performance of the algorithms in the whole learning life, “observation

points” are installed on chunks 5, 10, 15, 20, 25, 30, 35, and 40.

The tendency lines of the averaged prediction accuracy over the observation

points are plotted in Figure 7.6a. One can conclude from this figure that (i) REA

can provide higher prediction accuracy on testing data over time than UB, which

is consistent with the theoretical conclusion made in Section 7.3.3; (ii) REA does

not perform superiorly in terms of OA to other comparative algorithms over time.

In fact, it is the baseline (“Normal”) that provides the most competitive results in

terms of the OA on testing data most of the time. However, as discussed previ-

ously, OA is not of primary importance in the imbalanced learning scenario. It is

metrics such as ROC/AUROC that determine how well the algorithm performs

on imbalanced datasets.

The AUROC values of the comparative algorithms on the observation points

are given in Figure 7.6b, complemented by the corresponding ROC curves on

data chunks 10 (Fig. 7.7a), 20 (Fig. 7.7b), 30 (Fig. 7.7c), and 40 (Fig. 7.7d),

respectively, as well as the corresponding numeric AUROC values on these data

chunks given in Table 7.1. In this metric, REA gives superior performance over

other algorithms, and SERA can generally be better than the baseline. Besides

that, it is inclusive to make judgment regarding the comparison among the rest

of the algorithms as well as the baseline.

7.4.2.2 ELEC Dataset The electricity market dataset (ELEC dataset) [37] is

used as a real-world dataset to validate the effectiveness of the proposed algo-

rithm in real-world applications. The data were collected from the Australian New

South Wales (NSW) Electricity Market to reflect the electricity price fluctuation

(Up/Down) affected by demand and supply of the market. Since the influence of

the market on electricity price evolves unpredictably in the real world, the con-

crete representation of concept drifts embedded inside the dataset is inaccessible,

which is obviously different from synthetic datasets that set up the concept drift

by hand.

The original dataset contains 4531 examples dated from May 1996 to Decem-

ber 1998. We only retain examples after May 11, 1997, for this simulation because

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home