NONSTATIONARY STREAM DATA LEARNING WITH IMBALANCED CLASS DISTRIBUTION - Imbalanced Learning: Foundations, Algorithms, and Applications

Information Technology Reference

In-Depth Information

the learning performance on minority class examples in the current training data

chunk. This gives rise to the method of selectively accommodating previous

minority class examples with the most similar target concept as current minority

class into the current training data chunk.

A direct way to compare the similarity between a previous minority

class example and the current minority class is to calculate the Mahalanobis

distance between them [27, 28]. It differs from Euclidean distance in that

it takes into account the correlations of the dataset and is scale-invariant.

The Mahalanobis distance from a set of n -variate instances with a mean

value μ

[ μ 1 ,...,μ n ] T

=

and covariance matrix to an arbitrary instance

[ x 1 ,...,x n ] T is defined as [30]:

x

=

(x

μ) T − 1 (x

=

−

μ)

(7.2)

This, however, may exhibit a potential flaw: it assumes that there are no dis-

joint subconcepts within the minority class concept. Otherwise, there may exist

several subconcepts for the minority class, that is, 1 and 2 in Figure 7.1b

instead of in Figure 7.1a. This could be potentially improved by adopting the

k-nearest neighbors paradigm to estimate the degree of similarity [29]. Specifi-

cally, each previous minority class example determines the number of minority

examples that are within its k -nearest neighbors in the current training data chunk

as its degree of similarity to the current minority class set. It can be illustrated

from Figure 7.1c. Here, highlighted areas surrounded by dashed circles represent

the k -nearest neighbor search area for each previous minority class example: S 1 ,

S 2 , S 3 , S 4 ,and S 5 . Search area of S i represents the region where the k -nearest

neighbors of S i in the current training data chunk fall, which consists of both

the majority class examples and the minority class examples. Since the majority

class examples do not affect the similarity estimation, they are not shown in

Figure 7.1. Current minority class examples are represented by bold circles, and

the numbers of these falling in each of the “search areas” are 3, 1, 2, 1, and

0, respectively. Therefore, the similarity of S 1 , S 2 , S 3 , S 4 ,and S 5 to the current

minority example set is sorted as S 1 >S 3 >S 2 = S 4 >S 5.

Using previous minority class examples to compensate the imbalanced class

ratio could potentially violate the one-pass constraint [3], which mandates that

previous data can never be accessed by the learning process on current training

data chunk. The reason for imposing one-pass constraint for incremental learning

is to avoid overflow of the limited memory due to the retention of vast amount of

streaming data therein. However, given the unique nature of imbalanced learning

that minority class examples are quite scarce within each training data chunk,

the memory for keeping them around would be affordable.

7.2.2 How to Manage Concept Drifts

Concept drifts could be handled by solely relying on the current training data

chunk. It makes sense as the current training data chunk stands for accurate

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home