Automatic Categorization of Web Database Query Results - Advanced Database Query Systems

Database Reference

In-Depth Information

Figure 5. Average category cost of per selected house

Table 3. Results of survey

Categorization algorithm

#subjects that called it best

Cost-based

16

C4.5-categorization

4

Greedy

2

have also found more houses worth considering to buy using our algorithm than the other two algorithms,

suggesting our method makes it easier for users to find interesting houses. The tree generated by Greedy

algorithm has the worst results. This expected because the Greedy algorithm ignores different user

preferences, and dose not consider future partitions when generating category trees. The C4.5-Catego-

rization algorithm also has higher cost than our method. The reason is that our algorithm uses a parti-

tioning criterion that considers the cost of visiting the tuples in intermediate nodes, while C4.5-Catego-

rization algorithm does not. Moreover, our algorithm can use a few clusters to representative a large

scale tuples without lose accuracy (it will be tested in the next experiment).

The results show that using our approach, on average a subject only needs to visit no more than 8

tuples or intermediate nodes for queries Q 1 , Q 2 , Q 3 , and Q 4 to find the first relevant tuple, and needs to

visit about 18 tuples or intermediate nodes for Q 5 . The total navigational cost for our algorithm is less

than 45 for the former four queries, and is less than 80 for Q 5 . At the end of the study, we asked subjects

which categorization algorithm worked the best for them among all the queries they tried. The result of

that survey is reported in Table 3 and shows that a majority of subjects considered our algorithm the best.

Queries Clustering Experiment

This experiment aims at testing the quality of the algorithm for the queries clustering, whose accuracy

has a great impaction on the accuracy of the clusters of the tuples. We first translated each query in the

query history into its corresponding vector representation, and then we adopt the following strategies

to generate synthetic datasets. Every dataset is characterized by 4 parameters: n , m , l , noise . Here the n

Search WWH ::

Custom Search

Home