Automatic Categorization of Web Database Query Results - Advanced Database Query Systems

Database Reference

In-Depth Information

Cost of Visiting Tuples in Intermediate Nodes

Since user may choose to examine the tuples in an intermediate node, thus we need to consider the cost

of visiting these tuples. Let P st be the probability that user goes for option 'ShowTuples' for an inter-

mediate node v given that she explores v , and N ( t ) be the number of tuples in the intermediate node t .

Next, the cost of visiting tuples in an intermediate node equals P st · N ( t ).

Combining Costs

The remaining problem is how to combine the three types of costs. Here we take a normalization ap-

proach, which uses the following formula to estimate the gain of partitioning t into t 1 and t 2 ,

IGain t t t

( ,

,

) / ( )

E t

1

2

(12)

∑

∑ f

∑

((

N t

( )(

P

)) / (

N t

( )

P

))

∗

P N t

st

( )

j

i

l

j

=

1 2

,

C t

∩ ≠

C t

∩ ≠

f

i

j

l

The denominator is the product of the cost of visiting leaf tuples after partition normalized by the

cost before partition multiplying the cost of visiting the tuples in t . A partition always reduces the cost of

visiting tuples (the proof is straightforward). Thus the denominator ranges from (0, 1]. The nominator is

the information gain normalized by the entropy of t . We compute a ratio between these two terms rather

than sum of the nominator and (1-denominator) because in practice the nominator (information gain)

is often quite small. Thus the ratio is more sensitive to the nominator when the denominator is similar.

Complexity Analysis

Let n be the number of tuples in query results, m be the number of attributes, and k be the number of

classes. The gain in Formula 5 can be computed in O( k ) time. C4.5 also uses several optimizations such

as computing the gains for all partition points of an attribute in one pass, sorting all tuples on different

attribute values beforehand, and reusing the sort order. The cost of sorting tuples on different attribute

values is O( mn log n ), and the cost of computing gains for all possible partitions at one node is O( mnk )

because there are at most m partition attributes and n possible partition points, and each gain can be

computed in O( k ) time. If we assume the generated tree has O(log n ) levels, the total time is O( mnk log n ).

EXPERIMENTAL EVALUATION

In this section, we describe our experiments, report the experimental results and compare our approach

with several existing approaches.

Experimental Setup

We used Microsoft SQL Server 2005 RDBMS on a P4 3.2-GHz PC with 1 GB of RAM for our experi-

ments.

Dataset: For our evaluation, we setup a real estate database HouseDB (Price, SqFt, Bedrooms,

Bathrooms, Livingarea, Schooldistrict, View, Neighborhood, Boat, Garage, Buildyear…) containing

1,700,000 tuples extracted from MSN House&Home Web site. There are 27 attributes, 10 numerical

and 17 categorical. The total data size is 20 MB.

Advanced Database Query Systems

Search WWH ::

Custom Search

Home