Database Reference
In-Depth Information
QC i , where Q i is a representative query (i.e., it is the center of clustering QC i ), and Q i corresponds to
the query cluster QC i . The time complexity in the processing part is O( |H|k ), and thus the algorithm is
polynomial solvable (Meng & Ma, 2008).
Generation of Data Clusters
After queries pruning and clustering, we get a set of query clusters QC 1 ,…, QC k . For each tuple t i , we
generate a set S i consisting of query clusters such that one of the queries in that cluster returns t i . That is,
S i = { QC p | Q i QC p such that t i is returned by Q j }. We then group tuples according to their S i , and each
group forms a cluster. Each cluster is assigned a class label. The probability of users being interested in
cluster C i is computed as the sum of probabilities that a user asks a query in S i . This equals the sum of
frequencies of queries in S i divided by the sum of frequencies of all queries in the pruned query history H .
Example 2 . Suppose that there are four queries Q 1 , Q 2 , Q 3 , and Q 4 and 15 tuples r 1 , r 2 , …, r 15 . Q 1 returns
first 10 tuples r 1 , r 2 , …, r 10 , Q 2 returns the first 9 tuples r 1 , r 2 , …, r 9 , and r 14 , Q 3 returns r 11 , r 12 and r 14 ,
and Q 4 returns r 15 . Obviously, the first 9 tuples r 1 , r 2 , …, r 9 are equivalent to each other since they are
returned by both Q 1 and Q 2 . The data can be divided into five clusters { r 1 , r 2 , …, r 9 } (returned by Q 1 ,
Q 2 ), { r 10 } (returned by Q 1 only), { r 11 , r 12 , r 14 } (returned by Q 3 ), { r 15 } (returned by Q 4 ), and { r 13 } (not
returned by any query).
In example 2, after clustering we get two clusters { Q 1 , Q 2 }, { Q 3 } and { Q 4 }. Four clusters C 1 , C 2 , C 3,
and C 4 will be generated. The cluster C 1 corresponds to Q 1 and Q 2 and contains the first 10 tuples, with
probability P 1 = 2/4= 0.5. The cluster C 2 corresponds to Q 3 and contains r 11 , r 12 , r 14 , with probability P 2
= 1/4 = 0.25. The cluster C 3 corresponds to Q 4 and contains r 15 , with probability P 3 = 1/4 = 0.25. The
cluster C 4 contains r 13 , with probability 0, because r 13 is not returned by any query. The data clusters
generating algorithm is shown in Algorithm 4.
Algorithm 4. The algorithm for generating the preference-based data clusters
Input : pruned query history H , query results R
Output : data clusters of query results C 1 ,…, C q
1. QueryCluster ( H , U ) → { QC 1 , …, QC k }.
2. For each tuple r i R , identify S i = { QC p | Q j QC p such that r i is returned
by Q j }.
Group tuples in R by S i .
3. Output each group as a cluster C j , assign a class label for each cluster,
and compute
F
i
Q S
probability P
=
i
j
.
j
F
i
Q H
p
 
Search WWH ::




Custom Search