Database Reference
In-Depth Information
QC
i
∧
, where
Q
i
is a representative query (i.e., it is the center of clustering
QC
i
), and
Q
i
corresponds to
the query cluster
QC
i
. The time complexity in the processing part is O(
|H|k
), and thus the algorithm is
polynomial solvable (Meng & Ma, 2008).
Generation of Data Clusters
After queries pruning and clustering, we get a set of query clusters
QC
1
,…,
QC
k
. For each tuple
t
i
, we
generate a set
S
i
consisting of query clusters such that one of the queries in that cluster returns
t
i
. That is,
S
i
= {
QC
p
|
∧
Q
i
∧
QC
p
such that
t
i
is returned by
Q
j
}. We then group tuples according to their
S
i
, and each
group forms a cluster. Each cluster is assigned a class label. The probability of users being interested in
cluster
C
i
is computed as the sum of probabilities that a user asks a query in
S
i
. This equals the sum of
frequencies of queries in
S
i
divided by the sum of frequencies of all queries in the pruned query history
H
.
Example 2
. Suppose that there are four queries
Q
1
,
Q
2
,
Q
3
, and
Q
4
and 15 tuples
r
1
,
r
2
, …,
r
15
.
Q
1
returns
first 10 tuples
r
1
,
r
2
, …,
r
10
,
Q
2
returns the first 9 tuples
r
1
,
r
2
, …,
r
9
, and
r
14
,
Q
3
returns
r
11
,
r
12
and
r
14
,
and
Q
4
returns
r
15
. Obviously, the first 9 tuples
r
1
,
r
2
, …,
r
9
are equivalent to each other since they are
returned by both
Q
1
and
Q
2
. The data can be divided into five clusters {
r
1
,
r
2
, …,
r
9
} (returned by
Q
1
,
Q
2
), {
r
10
} (returned by
Q
1
only), {
r
11
,
r
12
,
r
14
} (returned by
Q
3
), {
r
15
} (returned by
Q
4
), and {
r
13
} (not
returned by any query).
In example 2, after clustering we get two clusters {
Q
1
,
Q
2
}, {
Q
3
} and {
Q
4
}. Four clusters
C
1
,
C
2
,
C
3,
and
C
4
will be generated. The cluster
C
1
corresponds to
Q
1
and
Q
2
and contains the first 10 tuples, with
probability
P
1
= 2/4= 0.5. The cluster
C
2
corresponds to
Q
3
and contains
r
11
,
r
12
,
r
14
, with probability
P
2
= 1/4 = 0.25. The cluster
C
3
corresponds to
Q
4
and contains
r
15
, with probability
P
3
= 1/4 = 0.25. The
cluster
C
4
contains
r
13
, with probability 0, because
r
13
is not returned by any query. The data clusters
generating algorithm is shown in Algorithm 4.
Algorithm 4. The algorithm for generating the preference-based data clusters
Input
: pruned query history
H
, query results
R
Output
: data clusters of query results
C
1
,…,
C
q
1. QueryCluster (
H
,
U
) → {
QC
1
, …,
QC
k
}.
2. For each tuple
r
i
∧
R
, identify
S
i
= {
QC
p
|
∧
Q
j
∧
QC
p
such that
r
i
is returned
by
Q
j
}.
Group tuples in
R
by
S
i
.
3. Output each group as a cluster
C
j
, assign a class label for each cluster,
and compute
∑
∑
F
i
Q S
∈
probability
P
=
i
j
.
j
F
i
Q H
∈
p