Information Technology Reference
In-Depth Information
of the features is assumed independent. With this assumption, we can compute
p(
x
i
|
y)
=
∀
j,k
p(
x
ij
=
x
ij k
|
y)
, where
x
ij
denotes feature
j
, for instance,
i
,and
x
ij k
denotes the
k
th possible feature value for feature
j
. Therefore, naıve Bayes
is simply skew insensitive as predictions are calibrated by
p(y)
or the prior
probability of class
y
.
Another classifier that has recently been made skew insensitive are decision
trees. Hellinger distance decision trees (HDDTs) [29] are strongly skew insen-
sitive, using an adaptation of the Hellinger distance as a decision tree splitting
criterion. They mitigate the need for sampling.
For the sake of clarity, we present the basic decision tree-building algorithm.
The algorithm (Algorithm
Build Tree
) differs from the traditional C4.5 [30]
algorithm in two important facets, both motivated by the research of Provost
and Domingos [31]. First, when building the decision tree,
Build Tree
does not
consider pruning or collapsing. Second, when classifying an instance, Laplace
smoothing is applied. These choices are because of empirical results, demonstrat-
ing that a full tree with Laplace smoothing outperforms all other configurations
[31], which are particularly relevant for imbalanced datasets. When C4.5 deci-
sion trees are built in this way (i.e., without pruning, without collapsing, and
with Laplace smoothing), they are called
C4.4 decision trees
[31].
Algorithm
Build Tree
Require:
Training set
T
, Cut-off size
C
1:
if
|
T
|
<C
then
2:
return
3:
end if
4:
for
each feature
f
of
T
do
5:
H
f
←
Calc Criterion Value(T,f)
6:
end for
7:
b
max
(H )
8:
for
each value
v
of
b
do
9:
Build T ree(T
x
b
=
v
,C)
10:
end for
←
An important thing to note is that
Build Tree
is only defined for nominal
features. For continuous features, a slight variation to
Build Tree
is used, where
Calc Criterion Value
sorts the instances by the feature value, finds all mean-
ingful splits, calculates the binary criterion value at each split, and returns the
highest distance; this is identical to the procedure used in C4.5.
The important function to consider when building a decision tree is
Calc Criterion Value
. In C4.5, this function is gain ratio, which is a measure
of purity based on entropy [30], while in HDDT, this function is Hellinger
distance. We now describe the Hellinger distance as a splitting criterion.
Search WWH ::
Custom Search