Database Reference
In-Depth Information
Data Clustering
We generate preference-based clusters as follows. We first define a binary relationship R over tuples such
that ( r i , r j ) R if and only if two tuples r i and r j appear in the results of the exactly same set of queries
in H . If ( r i , r j ) R , according to the query history, r i and r j are not distinguishable because each user that
requests r i also requests r j and vice versa. Clearly, R is reflexive, symmetric, and transitive. Thus R is an
equivalence relation and it partitions D into equivalence classes { C 1 ,..., C q }, where tuples equivalent to
each other are put into the same class. Those tuples not selected by any query will also form a cluster
associated with zero probability (since no users are interested in them). Thus, we can define the data
clustering problem as follows.
Problem 1. Given database D , query history H , find a set of disjoint clusters C = { C 1 ,…, C q } such that for
any tuples r i and r j C l , 1≤ l q , ( r i , r j ) R , and for any tuples r i and r j not in the same cluster, ( r i , r j ) ∧∧ R .
Since the query history H may contain many queries, thus we need to cluster the queries in H , and
then to cluster the tuples depending on the clusters of query history. We will propose the algorithm for
query history and data clustering in the next section.
Category Tree Construction
We next define the problem of category tree construction.
Problem 2 . Given D , C , Q , find a tree T ( V , E , L ) such that (i) it contains all tuples in the results of Q ,
and (ii) there does not exist another tree T ' satisfying (i) and with Cost ( T ', C ) < Cost ( T , C ).
The above problem can be proved to be NP-hard in a way similar to proving that the problem of
finding an optimal decision tree with a certain average length is NP-hard. Section 5 will present an ap-
proximate solution. The category tree construction algorithm is shown in Algorithm 1.
CLUSTERING FOR QUERY HISTORY AND DATA TUPLES
This section describes the algorithm to cluster tuples using query history. We propose the preprocessing
steps to refine the query history, which include prune unimportant queries and cluster the queries. Based
on different kinds of user preferences, we propose the method of generating tuples clusters.
Algorithm 1. The category tree construction algorithm
Input : query history H , database D , query Q
Output : a category tree T
1. (Offline step) Cluster tuples in query results D using H . The results are a set
of clusters C 1 ,..., C q , and each cluster C j , 1 ≤ j q , is assigned a probability P j .
2. (Online step) For the query Q , create a category tree T with minimal
LCost( T , C ) over tuples in results of Q , using C 1 ,..., C q as class labels.
Search WWH ::




Custom Search