Database Reference
In-Depth Information
Cost of Visiting Tuples in Intermediate Nodes
Since user may choose to examine the tuples in an intermediate node, thus we need to consider the cost
of visiting these tuples. Let P st be the probability that user goes for option 'ShowTuples' for an inter-
mediate node v given that she explores v , and N ( t ) be the number of tuples in the intermediate node t .
Next, the cost of visiting tuples in an intermediate node equals P st · N ( t ).
Combining Costs
The remaining problem is how to combine the three types of costs. Here we take a normalization ap-
proach, which uses the following formula to estimate the gain of partitioning t into t 1 and t 2 ,
IGain t t t
( ,
,
) / ( )
E t
1
2
(12)
f
((
N t
( )(
P
)) / (
N t
( )
P
))
P N t
st
( )
j
i
l
j
=
1 2
,
C t
∩ ≠
C t
∩ ≠
f
i
j
l
The denominator is the product of the cost of visiting leaf tuples after partition normalized by the
cost before partition multiplying the cost of visiting the tuples in t . A partition always reduces the cost of
visiting tuples (the proof is straightforward). Thus the denominator ranges from (0, 1]. The nominator is
the information gain normalized by the entropy of t . We compute a ratio between these two terms rather
than sum of the nominator and (1-denominator) because in practice the nominator (information gain)
is often quite small. Thus the ratio is more sensitive to the nominator when the denominator is similar.
Complexity Analysis
Let n be the number of tuples in query results, m be the number of attributes, and k be the number of
classes. The gain in Formula 5 can be computed in O( k ) time. C4.5 also uses several optimizations such
as computing the gains for all partition points of an attribute in one pass, sorting all tuples on different
attribute values beforehand, and reusing the sort order. The cost of sorting tuples on different attribute
values is O( mn log n ), and the cost of computing gains for all possible partitions at one node is O( mnk )
because there are at most m partition attributes and n possible partition points, and each gain can be
computed in O( k ) time. If we assume the generated tree has O(log n ) levels, the total time is O( mnk log n ).
EXPERIMENTAL EVALUATION
In this section, we describe our experiments, report the experimental results and compare our approach
with several existing approaches.
Experimental Setup
We used Microsoft SQL Server 2005 RDBMS on a P4 3.2-GHz PC with 1 GB of RAM for our experi-
ments.
Dataset: For our evaluation, we setup a real estate database HouseDB (Price, SqFt, Bedrooms,
Bathrooms, Livingarea, Schooldistrict, View, Neighborhood, Boat, Garage, Buildyear…) containing
1,700,000 tuples extracted from MSN House&Home Web site. There are 27 attributes, 10 numerical
and 17 categorical. The total data size is 20 MB.
Search WWH ::




Custom Search