Automatic Categorization of Web Database Query Results - Advanced Database Query Systems

Database Reference

In-Depth Information

Table 2. Test queries used in user study

Queries

Result size

Q 1 : Price between 250000 and 300000 and SqFt > 1000

323

Q 2 : Schooldistrict = Tukwila and View = GreenBelt

894

Q 3 : Price < 250000 and Neighborhood in { Winslow, Shoreline}

452

Q 4 : Schooldistrict = Seattle and View in {Water, GreenBelt}

558

Q 5 : SqFt between 600 and 1000

16213

Query history: In our experiments, we requested 40 subjects to behave as different kinds of house

buyers, such as rich people, clerks, workers, women, young couples, etc. and post queries against the

database. We collected 2000 queries for the database and these queries are used as the query history.

Each subject was asked to submit 15 queries for HouseDB, each query had 2~6 conditions and had

4.2 specified attributes on average. We assume each query has equal weight. We did observe that users

started with a general query which returned many answers, and then gradually refined the query until it

returned a small number of answers.

Algorithm: We implemented all algorithms in C# and connected to the RDBMS through ADO.

The clusters are stored by adding a column to the data table to store the class labels of each tuple. The

stopping threshold λ in build tree algorithm is set to 0.002. We have developed an interface that allows

users to classify query results using generated trees.

Comparison: We compare our create tree algorithm (henceforth referred to as Cost-based algorithm)

with the algorithm proposed by Chakrabarti et. al. (Chakrabarti, Chaudhuri & Hwang, 2004) (henceforth

referred to as Greedy algorithm). It differs from our algorithm on two aspects: (i) it does not consider

different user preferences, and (ii) it does not consider the cost of intermediate nodes generated by fu-

ture partitions. We also compare the algorithm proposed by Chen et. al. (Chen & Li, 2007) (henceforth

referred to as C4.5-Categorization algorithm), it first uses the merging queries step to generate data

clusters and corresponding labels, then uses modified C4.5 to create the navigational tree. It differs

from our algorithm on two aspects: (i) it needs to execute queries on the dataset to evaluate the queries

similarity and then to merging the similar queries, and (ii) it can not expand the intermediate nodes to

show tuples and it thus does not consider the cost of visiting tuples of intermediate nodes.

Setup of user study: We conduced an empirical study by asking 5 subjects (with no overlap with

the 40 users submitting the query history) to use this interface. The subjects were randomly selected

colleagues, students, etc. Each subject was given a tutorial about how to use this interface. Next, each

subject was given the results of 5 queries listed in Table 2, which do not appear in the query history.

For each such query, the subject was asked to go along with the trees generated by the three algorithms

mentioned above, and to select 5-10 houses that he would like to buy.

Categorization Cost Experiment

The experiment aims at comparing the cost of three categorization algorithms and showing the efficiency

of our categorization approach. The actual category cost is defined as follows:

Advanced Database Query Systems

Search WWH ::

Custom Search

Home