IMBALANCED DATASETS: FROM SAMPLING TO CLASSIFIERS - Imbalanced Learning: Foundations, Algorithms, and Applications - page 50

Information Technology Reference

In-Depth Information

of the features is assumed independent. With this assumption, we can compute

p( x i | y) = ∀ j,k p( x ij = x ij k | y) , where x ij denotes feature j , for instance, i ,and

x ij k denotes the k th possible feature value for feature j . Therefore, naıve Bayes

is simply skew insensitive as predictions are calibrated by p(y) or the prior

probability of class y .

Another classifier that has recently been made skew insensitive are decision

trees. Hellinger distance decision trees (HDDTs) [29] are strongly skew insen-

sitive, using an adaptation of the Hellinger distance as a decision tree splitting

criterion. They mitigate the need for sampling.

For the sake of clarity, we present the basic decision tree-building algorithm.

The algorithm (Algorithm Build Tree ) differs from the traditional C4.5 [30]

algorithm in two important facets, both motivated by the research of Provost

and Domingos [31]. First, when building the decision tree, Build Tree does not

consider pruning or collapsing. Second, when classifying an instance, Laplace

smoothing is applied. These choices are because of empirical results, demonstrat-

ing that a full tree with Laplace smoothing outperforms all other configurations

[31], which are particularly relevant for imbalanced datasets. When C4.5 deci-

sion trees are built in this way (i.e., without pruning, without collapsing, and

with Laplace smoothing), they are called C4.4 decision trees [31].

Algorithm Build Tree

Require: Training set T , Cut-off size C

1: if | T | <C then

2: return

3: end if

4: for each feature f of T do

5: H f ← Calc Criterion Value(T,f)

6: end for

7: b

max (H )

8: for each value v of b do

9: Build T ree(T x b = v ,C)

10: end for

←

An important thing to note is that Build Tree is only defined for nominal

features. For continuous features, a slight variation to Build Tree is used, where

Calc Criterion Value sorts the instances by the feature value, finds all mean-

ingful splits, calculates the binary criterion value at each split, and returns the

highest distance; this is identical to the procedure used in C4.5.

The important function to consider when building a decision tree is

Calc Criterion Value . In C4.5, this function is gain ratio, which is a measure

of purity based on entropy [30], while in HDDT, this function is Hellinger

distance. We now describe the Hellinger distance as a splitting criterion.

Next Page

Imbalanced Learning: Foundations, Algorithms, and Applications

Search WWH ::

Custom Search

Home