Automatic Categorization of Web Database Query Results - Advanced Database Query Systems

Database Reference

In-Depth Information

∑

N t

( )

−

N t

( )(

)

(9)

C t

∩ ≠

1 2

C t

∩

The decision tree construction algorithms do not consider the cost of visiting leaf tuples. For example,

consider a partition that generates two nodes that contain tuples with labels ( C 1 , C 2 , C 1 ) and ( C 2 ), and

a partition that generates two nodes that contain tuples with labels ( C 2 , C 1 , C 2 ) and ( C 1 ). According to

the discussion in Section 5.4.1, these two partitions have the same information gain. However, if P 1 =

0.5 and P 2 = 0, then the category cost for the first partition is smaller because the cost is 1.5 for the first

partition and is 2 for the second partition.

Cost of Visiting Intermediate Nodes

For estimate the cost of visiting intermediate nodes, we adopted the method proposed by Chen et. al.

(Chen & Li, 2007). According to the definition in (Chen & Li, 2007), the perfect tree is that their leaves

only contain tuples of on class and can not be partitioned further. In fact, the perfect tree is a decision

tree. Given a perfect tree T with N tuples and k classes, where each class C i in T has N i tuples. The en-

∑ 1

log

tropy E ( T ) = −

approximates the average length of root-to-leaf paths for all tuples in T .

Since T is a perfect tree, its leaves contain only one class per node. For each such leaf L i that contains

N i tuples of class C i , it can be further expanded into a smaller subtree T i which is rooted at L i , and its leaf

contains exactly one record in C i . Each such small subtree T i contains N i leaves. All these subtrees and

T compose a big tree T b that contains

∑ 1 leaves. We further assume that each T i and the

big tree T b are balanced, thus the height for T i is log N i and the height for T b is log N . Note that for the i -th

leaf L i in T , the length of the path from root to L i equals the height of big tree T b minus the height of

small tree T i . There are N i tuples in L i , all with the same path from the root. Thus the average length of

root-to-leaf paths for tuples is defined as follows,

≤ ≤

i k

∑

=−

log

(log

−

log

)

(10)

≤ ≤

i k

This is exactly the entropy E ( t ). Note that most existing decision tree algorithms choose the partition

that maximizes information gain. Information gain is the reduction of entropy due to a partition and is

represented in the following formula,

N E t

IGain t t t

( ,

)

E t

( )

−

( )

−

( )

(11)

Thus a partition with a high information gain will generate a tree with a low entropy. And, this tree

will have short root-to-leaf paths as well. Since the cost of visiting intermediate nodes equals the prod-

uct of path lengths and fan-out in Definition 2, if we assume the average fan-out is about the same for

all trees, then the cost of visiting intermediate nodes is proportional to the length of root-to-leaf paths.

Therefore, the cost reduction of visiting intermediate nodes can be used information gain to estimate.

Advanced Database Query Systems

Search WWH ::

Custom Search

Home