Information Technology Reference
In-Depth Information
Algorithm 7.1 Naive implementation of the over-sampling method
Inputs:
1: timestamp: t
2: current training data chunk:
S
(t)
={ ( x 1 ,y 1 ),...,( x m ,y m ) }
, y i Y */
3: current data set under evaluation:
/* x i X
x 1 ,..., x n }
T
(t)
={
/* x j
is the j -th instance. Class label of instances in
T
(t) is unknown. */
4: base classifier: L
/* e.g., CART, MLP, etc., */
Procedure:
5: for t :1 ... do
6:
(t)
(t) , N
(t)
S
→{ P
}
(t) , N
(t) are the minority and majority class sets for S
(t) , respectively. */
/* P
(t) )
/* class labels of instances within M are all minority class label. */
(t)
7:
M
SMOTE ( P
h (t)
(t) , M
(t)
8:
final L( { S
} )
/* h (t)
final : Y
X
*/
return hypothesis h (t)
final for predicting the class label of any instance x in
(t) .
9:
T
(t) , M
(t)
augmented data chunk { S
} . The learning performance of h t is examined
by evaluating its prediction error rate ε on the original S
(t) .If ε> 0 . 5, that is,
worse than the random guess, h t is abandoned and the same procedure is applied
again until a qualified hypothesis could be obtained. This, however, exposes
a potential problem of the algorithm that steps 9-12 might be repeated many
times before it can work out a hypothesis with acceptable performance, which
prohibits itself from being applied to high-speed data streams. A similar procedure
also applies to all base hypotheses created on previous data chunks, that is,
{ h 1 ,h 2 ,...,h t 1 } . Those that fail the test, that is, ε> 0 . 5 on the current data
chunk, will have their weights set to be 0. In this way, it is guaranteed that
only base hypotheses that achieve satisfying performance on current training
data chunks are used to constitute the ensemble classifier h (t)
final for prediction of
(t) .
unlabeled instances in the testing dataset T
7.3.2 Take-In-All Accommodation Algorithm
Instead of creating synthetic minority class instances to balance the training data
chunk, this kind of algorithm keeps around all previous minority class examples
over time and pushes them into the current training data chunk to compensate
the imbalanced class ratio. An implementation of this algorithm [26] is shown in
Algorithm 7.3.
All minority class examples are kept inside the data queue Q . Upon arrival
of a new data chunk, all previous minority class examples are pushed into it to
compensate its imbalanced class distribution. Then an ensemble of classifiers is
Search WWH ::




Custom Search