Automatic Categorization of Web Database Query Results - Advanced Database Query Systems

Database Reference

In-Depth Information

QC i ∧ , where Q i is a representative query (i.e., it is the center of clustering QC i ), and Q i corresponds to

the query cluster QC i . The time complexity in the processing part is O( |H|k ), and thus the algorithm is

polynomial solvable (Meng & Ma, 2008).

Generation of Data Clusters

After queries pruning and clustering, we get a set of query clusters QC 1 ,…, QC k . For each tuple t i , we

generate a set S i consisting of query clusters such that one of the queries in that cluster returns t i . That is,

S i = { QC p | ∧ Q i ∧ QC p such that t i is returned by Q j }. We then group tuples according to their S i , and each

group forms a cluster. Each cluster is assigned a class label. The probability of users being interested in

cluster C i is computed as the sum of probabilities that a user asks a query in S i . This equals the sum of

frequencies of queries in S i divided by the sum of frequencies of all queries in the pruned query history H .

Example 2 . Suppose that there are four queries Q 1 , Q 2 , Q 3 , and Q 4 and 15 tuples r 1 , r 2 , …, r 15 . Q 1 returns

first 10 tuples r 1 , r 2 , …, r 10 , Q 2 returns the first 9 tuples r 1 , r 2 , …, r 9 , and r 14 , Q 3 returns r 11 , r 12 and r 14 ,

and Q 4 returns r 15 . Obviously, the first 9 tuples r 1 , r 2 , …, r 9 are equivalent to each other since they are

returned by both Q 1 and Q 2 . The data can be divided into five clusters { r 1 , r 2 , …, r 9 } (returned by Q 1 ,

Q 2 ), { r 10 } (returned by Q 1 only), { r 11 , r 12 , r 14 } (returned by Q 3 ), { r 15 } (returned by Q 4 ), and { r 13 } (not

returned by any query).

In example 2, after clustering we get two clusters { Q 1 , Q 2 }, { Q 3 } and { Q 4 }. Four clusters C 1 , C 2 , C 3,

and C 4 will be generated. The cluster C 1 corresponds to Q 1 and Q 2 and contains the first 10 tuples, with

probability P 1 = 2/4= 0.5. The cluster C 2 corresponds to Q 3 and contains r 11 , r 12 , r 14 , with probability P 2

= 1/4 = 0.25. The cluster C 3 corresponds to Q 4 and contains r 15 , with probability P 3 = 1/4 = 0.25. The

cluster C 4 contains r 13 , with probability 0, because r 13 is not returned by any query. The data clusters

generating algorithm is shown in Algorithm 4.

Algorithm 4. The algorithm for generating the preference-based data clusters

Input : pruned query history H , query results R

Output : data clusters of query results C 1 ,…, C q

1. QueryCluster ( H , U ) → { QC 1 , …, QC k }.

2. For each tuple r i ∧ R , identify S i = { QC p | ∧ Q j ∧ QC p such that r i is returned

by Q j }.

Group tuples in R by S i .

3. Output each group as a cluster C j , assign a class label for each cluster,

and compute

∑

Q S

∈

probability P

Q H

∈

Search WWH ::

Custom Search

Home