Database Reference
In-Depth Information
Table 2. Test queries used in user study
Queries
Result size
Q 1 : Price between 250000 and 300000 and SqFt > 1000
323
Q 2 : Schooldistrict = Tukwila and View = GreenBelt
894
Q 3 : Price < 250000 and Neighborhood in { Winslow, Shoreline}
452
Q 4 : Schooldistrict = Seattle and View in {Water, GreenBelt}
558
Q 5 : SqFt between 600 and 1000
16213
Query history: In our experiments, we requested 40 subjects to behave as different kinds of house
buyers, such as rich people, clerks, workers, women, young couples, etc. and post queries against the
database. We collected 2000 queries for the database and these queries are used as the query history.
Each subject was asked to submit 15 queries for HouseDB, each query had 2~6 conditions and had
4.2 specified attributes on average. We assume each query has equal weight. We did observe that users
started with a general query which returned many answers, and then gradually refined the query until it
returned a small number of answers.
Algorithm: We implemented all algorithms in C# and connected to the RDBMS through ADO.
The clusters are stored by adding a column to the data table to store the class labels of each tuple. The
stopping threshold λ in build tree algorithm is set to 0.002. We have developed an interface that allows
users to classify query results using generated trees.
Comparison: We compare our create tree algorithm (henceforth referred to as Cost-based algorithm)
with the algorithm proposed by Chakrabarti et. al. (Chakrabarti, Chaudhuri & Hwang, 2004) (henceforth
referred to as Greedy algorithm). It differs from our algorithm on two aspects: (i) it does not consider
different user preferences, and (ii) it does not consider the cost of intermediate nodes generated by fu-
ture partitions. We also compare the algorithm proposed by Chen et. al. (Chen & Li, 2007) (henceforth
referred to as C4.5-Categorization algorithm), it first uses the merging queries step to generate data
clusters and corresponding labels, then uses modified C4.5 to create the navigational tree. It differs
from our algorithm on two aspects: (i) it needs to execute queries on the dataset to evaluate the queries
similarity and then to merging the similar queries, and (ii) it can not expand the intermediate nodes to
show tuples and it thus does not consider the cost of visiting tuples of intermediate nodes.
Setup of user study: We conduced an empirical study by asking 5 subjects (with no overlap with
the 40 users submitting the query history) to use this interface. The subjects were randomly selected
colleagues, students, etc. Each subject was given a tutorial about how to use this interface. Next, each
subject was given the results of 5 queries listed in Table 2, which do not appear in the query history.
For each such query, the subject was asked to go along with the trees generated by the three algorithms
mentioned above, and to select 5-10 houses that he would like to buy.
Categorization Cost Experiment
The experiment aims at comparing the cost of three categorization algorithms and showing the efficiency
of our categorization approach. The actual category cost is defined as follows:
Search WWH ::




Custom Search