Automatic Categorization of Web Database Query Results - Advanced Database Query Systems

Database Reference

In-Depth Information

Data Clustering

We generate preference-based clusters as follows. We first define a binary relationship R over tuples such

that ( r i , r j ) ∧ R if and only if two tuples r i and r j appear in the results of the exactly same set of queries

in H . If ( r i , r j ) ∧ R , according to the query history, r i and r j are not distinguishable because each user that

requests r i also requests r j and vice versa. Clearly, R is reflexive, symmetric, and transitive. Thus R is an

equivalence relation and it partitions D into equivalence classes { C 1 ,..., C q }, where tuples equivalent to

each other are put into the same class. Those tuples not selected by any query will also form a cluster

associated with zero probability (since no users are interested in them). Thus, we can define the data

clustering problem as follows.

Problem 1. Given database D , query history H , find a set of disjoint clusters C = { C 1 ,…, C q } such that for

any tuples r i and r j ∧ C l , 1≤ l ≤ q , ( r i , r j ) ∧ R , and for any tuples r i and r j not in the same cluster, ( r i , r j ) ∧∧ R .

Since the query history H may contain many queries, thus we need to cluster the queries in H , and

then to cluster the tuples depending on the clusters of query history. We will propose the algorithm for

query history and data clustering in the next section.

Category Tree Construction

We next define the problem of category tree construction.

Problem 2 . Given D , C , Q , find a tree T ( V , E , L ) such that (i) it contains all tuples in the results of Q ,

and (ii) there does not exist another tree T ' satisfying (i) and with Cost ( T ', C ) < Cost ( T , C ).

The above problem can be proved to be NP-hard in a way similar to proving that the problem of

finding an optimal decision tree with a certain average length is NP-hard. Section 5 will present an ap-

proximate solution. The category tree construction algorithm is shown in Algorithm 1.

CLUSTERING FOR QUERY HISTORY AND DATA TUPLES

This section describes the algorithm to cluster tuples using query history. We propose the preprocessing

steps to refine the query history, which include prune unimportant queries and cluster the queries. Based

on different kinds of user preferences, we propose the method of generating tuples clusters.

Algorithm 1. The category tree construction algorithm

Input : query history H , database D , query Q

Output : a category tree T

1. (Offline step) Cluster tuples in query results D using H . The results are a set

of clusters C 1 ,..., C q , and each cluster C j , 1 ≤ j ≤ q , is assigned a probability P j .

2. (Online step) For the query Q , create a category tree T with minimal

LCost( T , C ) over tuples in results of Q , using C 1 ,..., C q as class labels.

Search WWH ::

Custom Search

Home