Database Reference
In-Depth Information
Now, we can define the similarity between
Q
1
and
Q
2
using their vector representations
V
Q
1
and
V
Q
2
as follows:
V V
V
⋅
Q
1
Q
2
Sim(
Q Q
,
)
=
cos(
V V
,
)
=
(2)
1
2
Q
1
Q
2
|
||
V
|
Q
1
Q
2
In order to quantify how well a query
Q
1
is represented by another query
Q
2
, we need to define a
distance measure between two queries. Based on the similarity mentioned above, the distance between
Q
1
and
Q
2
can be defined as
d
(
Q
1
,
Q
2
) = 1- Sim(
Q
1
,
Q
2
)
(3)
Based on the definitions above, the queries clustering problem can be defined. Let
H
be the s
et
of
m
queries in query history:
H
= {
Q
1
,…,
Q
m
}. The we need to find a set of
k
queries
H
k
= {
Q
1
,…,
Q
k
} (
k
<
m
) such that:
cos (
t H
)
=
∈
d Q QC
( ,
)
(4)
k
k
Q H
is minimized. The distance of a query
Q
i
from a set of queries
H
is defined as
d Q H
(
,
) min (
=
d Q Q
,
)
(5)
i
i
j
Q H
∈
j
We call the queries in set
H
k
representative queries and associate with each representative query
Q
i
a set of queries
QC
=
{
Q Q
|
=
argmin
d Q Q
(
,
)}
.
j
j
'
j
i
j
'
i
Complexity of the Queries Clustering Problem
The problem of queries clustering is the same as the
k
-median problem. The
k
-median problem is well
known to be NP-hard. An instance of the metric
k
-median problem consists of a metric space χ = (
X
, c),
where
X
is a set of points and c is a distance function (also called the cost) that specifies the distance c
xy
≥ 0 between any pair of nodes
x
,
y
∧
X
. The distance function is reflexive, symmetric, and satisfies the
triangle inequality. Given a set of points
F
∧
X
, the cost of
F
is defined by
cost
∑
( )
F
c
,
=
where
x X xF
∈
min
for
x
∧
X
. The objective is to find a
k
-element set
F
∧
X
that minimizes cost(
F
)
(Chrobak, Keynon & Young, 2005). Obviously, the queries clustering problem can be treated as the
k
-
median problem and it is also NP-hard. Thus, we have to think of approximation algorithms for solving
it.
c
=
c
xF
f F
∈
xF