Information Technology Reference
In-Depth Information
block consists of 10 data chunks, each of which has 1000 examples as the training
dataset and 200 instances as the testing dataset. Examples with the sum of the
two features greater than the threshold belong to the majority class, and those
otherwise reside in the minority class. The number of generated minority class
data is restricted to be 1 / 20 of the total number of data in the corresponding data
chunk. In other words, the imbalanced ratio is set to be 0 . 05 in our simulation.
In order to introduce some uncertainty/noise into the dataset, 1% of the examples
inside each training dataset are randomly sampled to reverse their class labels. In
this way, approximately 1 / 6 of the minority examples are erroneously labeled,
which raises a challenge on handling noise for all comparative algorithms learning
from this dataset.
The simulation results for the SEA dataset are averaged over 10 random runs.
During each random run, the dataset is basically generated all over again. To
view the performance of the algorithms in the whole learning life, “observation
points” are installed on chunks 5, 10, 15, 20, 25, 30, 35, and 40.
The tendency lines of the averaged prediction accuracy over the observation
points are plotted in Figure 7.6a. One can conclude from this figure that (i) REA
can provide higher prediction accuracy on testing data over time than UB, which
is consistent with the theoretical conclusion made in Section 7.3.3; (ii) REA does
not perform superiorly in terms of OA to other comparative algorithms over time.
In fact, it is the baseline (“Normal”) that provides the most competitive results in
terms of the OA on testing data most of the time. However, as discussed previ-
ously, OA is not of primary importance in the imbalanced learning scenario. It is
metrics such as ROC/AUROC that determine how well the algorithm performs
on imbalanced datasets.
The AUROC values of the comparative algorithms on the observation points
are given in Figure 7.6b, complemented by the corresponding ROC curves on
data chunks 10 (Fig. 7.7a), 20 (Fig. 7.7b), 30 (Fig. 7.7c), and 40 (Fig. 7.7d),
respectively, as well as the corresponding numeric AUROC values on these data
chunks given in Table 7.1. In this metric, REA gives superior performance over
other algorithms, and SERA can generally be better than the baseline. Besides
that, it is inclusive to make judgment regarding the comparison among the rest
of the algorithms as well as the baseline.
7.4.2.2 ELEC Dataset The electricity market dataset (ELEC dataset) [37] is
used as a real-world dataset to validate the effectiveness of the proposed algo-
rithm in real-world applications. The data were collected from the Australian New
South Wales (NSW) Electricity Market to reflect the electricity price fluctuation
(Up/Down) affected by demand and supply of the market. Since the influence of
the market on electricity price evolves unpredictably in the real world, the con-
crete representation of concept drifts embedded inside the dataset is inaccessible,
which is obviously different from synthetic datasets that set up the concept drift
by hand.
The original dataset contains 4531 examples dated from May 1996 to Decem-
ber 1998. We only retain examples after May 11, 1997, for this simulation because
Search WWH ::




Custom Search