Database Reference
In-Depth Information
∑
∑
∑
N t
( )
P
−
N t
( )(
P
)
(9)
l
j
i
C t
∩ ≠
f
j
=
1 2
,
C t
∩
l
i
j
The decision tree construction algorithms do not consider the cost of visiting leaf tuples. For example,
consider a partition that generates two nodes that contain tuples with labels (
C
1
,
C
2
,
C
1
) and (
C
2
), and
a partition that generates two nodes that contain tuples with labels (
C
2
,
C
1
,
C
2
) and (
C
1
). According to
the discussion in Section 5.4.1, these two partitions have the same information gain. However, if
P
1
=
0.5 and
P
2
= 0, then the category cost for the first partition is smaller because the cost is 1.5 for the first
partition and is 2 for the second partition.
Cost of Visiting Intermediate Nodes
For estimate the cost of visiting intermediate nodes, we adopted the method proposed by
Chen et. al.
(Chen & Li, 2007). According to the definition in (Chen & Li, 2007), the
perfect
tree is that their leaves
only contain tuples of on class and can not be partitioned further. In fact, the perfect tree is a decision
tree. Given a perfect tree
T
with
N
tuples and
k
classes, where each class
C
i
in
T
has
N
i
tuples. The en-
N
N
N
N
k
∑
1
i
log
i
tropy
E
(
T
) = −
approximates the average length of root-to-leaf paths for all tuples in
T
.
i
=
Since
T
is a perfect tree, its leaves contain only one class per node. For each such leaf
L
i
that contains
N
i
tuples of class
C
i
, it can be further expanded into a smaller subtree
T
i
which is rooted at
L
i
, and its leaf
contains exactly one record in
C
i
. Each such small subtree
T
i
contains
N
i
leaves. All these subtrees and
T
compose a big tree
T
b
that contains
∑
1
leaves. We further assume that each
T
i
and the
big tree
T
b
are balanced, thus the height for
T
i
is log
N
i
and the height for
T
b
is log
N
. Note that for the
i
-th
leaf
L
i
in
T
, the length of the path from root to
L
i
equals the height of big tree
T
b
minus the height of
small tree
T
i
. There are
N
i
tuples in
L
i
, all with the same path from the root. Thus the average length of
root-to-leaf paths for tuples is defined as follows,
N
=
N
i
≤ ≤
i k
N
N
k
N
N
N
N
∑
∑
N
N
=−
i
log
i
i
(log
−
log
)
(10)
i
i
=
1
1
≤ ≤
i k
This is exactly the entropy
E
(
t
). Note that most existing decision tree algorithms choose the partition
that maximizes information gain. Information gain is the reduction of entropy due to a partition and is
represented in the following formula,
N
N
E t
N
N
E t
IGain t t t
( ,
,
)
=
E t
( )
−
1
( )
−
2
( )
(11)
1
2
1
2
Thus a partition with a high information gain will generate a tree with a low entropy. And, this tree
will have short root-to-leaf paths as well. Since the cost of visiting intermediate nodes equals the prod-
uct of path lengths and fan-out in Definition 2, if we assume the average fan-out is about the same for
all trees, then the cost of visiting intermediate nodes is proportional to the length of root-to-leaf paths.
Therefore, the cost reduction of visiting intermediate nodes can be used information gain to estimate.