NONSTATIONARY STREAM DATA LEARNING WITH IMBALANCED CLASS DISTRIBUTION - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

Algorithm 7.1 Naive implementation of the over-sampling method

Inputs:

1: timestamp: t

2: current training data chunk:

(t)

={ ( x 1 ,y 1 ),...,( x m ,y m ) }

, y i ∈ Y */

3: current data set under evaluation:

/* x i ∈ X

x 1 ,..., x n }

(t)

/* x j

is the j -th instance. Class label of instances in

(t) is unknown. */

4: base classifier: L

/* e.g., CART, MLP, etc., */

Procedure:

5: for t :1 → ... do

(t)

(t) , N

(t)

→{ P

}

(t) , N

(t) are the minority and majority class sets for S

(t) , respectively. */

/* P

(t) )

/* class labels of instances within M are all minority class label. */

(t)

← SMOTE ( P

h (t)

(t) , M

(t)

final ← L( { S

} )

/* h (t)

final : Y

← X

return hypothesis h (t)

final for predicting the class label of any instance x in

(t) .

(t) , M

(t)

augmented data chunk { S

} . The learning performance of h t is examined

by evaluating its prediction error rate ε on the original S

(t) .If ε> 0 . 5, that is,

worse than the random guess, h t is abandoned and the same procedure is applied

again until a qualified hypothesis could be obtained. This, however, exposes

a potential problem of the algorithm that steps 9-12 might be repeated many

times before it can work out a hypothesis with acceptable performance, which

prohibits itself from being applied to high-speed data streams. A similar procedure

also applies to all base hypotheses created on previous data chunks, that is,

{ h 1 ,h 2 ,...,h t − 1 } . Those that fail the test, that is, ε> 0 . 5 on the current data

chunk, will have their weights set to be 0. In this way, it is guaranteed that

only base hypotheses that achieve satisfying performance on current training

data chunks are used to constitute the ensemble classifier h (t)

final for prediction of

(t) .

unlabeled instances in the testing dataset T

7.3.2 Take-In-All Accommodation Algorithm

Instead of creating synthetic minority class instances to balance the training data

chunk, this kind of algorithm keeps around all previous minority class examples

over time and pushes them into the current training data chunk to compensate

the imbalanced class ratio. An implementation of this algorithm [26] is shown in

Algorithm 7.3.

All minority class examples are kept inside the data queue Q . Upon arrival

of a new data chunk, all previous minority class examples are pushed into it to

compensate its imbalanced class distribution. Then an ensemble of classifiers is

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home