Database Reference
In-Depth Information
N t
( )
P
N t
( )(
P
)
(9)
l
j
i
C t
∩ ≠
f
j
=
1 2
,
C t
l
i
j
The decision tree construction algorithms do not consider the cost of visiting leaf tuples. For example,
consider a partition that generates two nodes that contain tuples with labels ( C 1 , C 2 , C 1 ) and ( C 2 ), and
a partition that generates two nodes that contain tuples with labels ( C 2 , C 1 , C 2 ) and ( C 1 ). According to
the discussion in Section 5.4.1, these two partitions have the same information gain. However, if P 1 =
0.5 and P 2 = 0, then the category cost for the first partition is smaller because the cost is 1.5 for the first
partition and is 2 for the second partition.
Cost of Visiting Intermediate Nodes
For estimate the cost of visiting intermediate nodes, we adopted the method proposed by Chen et. al.
(Chen & Li, 2007). According to the definition in (Chen & Li, 2007), the perfect tree is that their leaves
only contain tuples of on class and can not be partitioned further. In fact, the perfect tree is a decision
tree. Given a perfect tree T with N tuples and k classes, where each class C i in T has N i tuples. The en-
N
N
N
N
k
1
i
log
i
tropy E ( T ) = −
approximates the average length of root-to-leaf paths for all tuples in T .
i
=
Since T is a perfect tree, its leaves contain only one class per node. For each such leaf L i that contains
N i tuples of class C i , it can be further expanded into a smaller subtree T i which is rooted at L i , and its leaf
contains exactly one record in C i . Each such small subtree T i contains N i leaves. All these subtrees and
T compose a big tree T b that contains
1 leaves. We further assume that each T i and the
big tree T b are balanced, thus the height for T i is log N i and the height for T b is log N . Note that for the i -th
leaf L i in T , the length of the path from root to L i equals the height of big tree T b minus the height of
small tree T i . There are N i tuples in L i , all with the same path from the root. Thus the average length of
root-to-leaf paths for tuples is defined as follows,
N
=
N
i
≤ ≤
i k
N
N
k
N
N
N
N
N
N
=−
i
log
i
i
(log
log
)
(10)
i
i
=
1
1
≤ ≤
i k
This is exactly the entropy E ( t ). Note that most existing decision tree algorithms choose the partition
that maximizes information gain. Information gain is the reduction of entropy due to a partition and is
represented in the following formula,
N
N E t
N
N E t
IGain t t t
( ,
,
)
=
E t
( )
1
( )
2
( )
(11)
1
2
1
2
Thus a partition with a high information gain will generate a tree with a low entropy. And, this tree
will have short root-to-leaf paths as well. Since the cost of visiting intermediate nodes equals the prod-
uct of path lengths and fan-out in Definition 2, if we assume the average fan-out is about the same for
all trees, then the cost of visiting intermediate nodes is proportional to the length of root-to-leaf paths.
Therefore, the cost reduction of visiting intermediate nodes can be used information gain to estimate.
Search WWH ::




Custom Search