Information Technology Reference
InDepth Information
In [
1
], it is investigated how to diversify search results by making explicit use of
knowledge about the topics that the query or the documents may refer to. Specif
ically it is assumed that there exists a taxonomy of information, and the diversity
in user intents is modeled at the topical level of the taxonomy. On this basis, an
objective is proposed to explicitly trade off relevance and diversity as follows:
q)
1
q,c)
,
1

−
−

P(S

q)
=
P(c
V(x
(6.13)
c
x
∈
S
where
V(x
q,c)
denotes the relevance probability of a document
x
for query
q
when the intended category is
c
.
It is not difficult to get the probabilistic interpretation of the above objective.
Actually, it exactly describes the probability that at least one of the documents in
the document set
S
satisfies the average user (in terms of expectation) who issues
query
q
.
It has been proven that the general problem of optimizing the above objective
is NPhard. The good news is that the objective function admits a submodularity
structure [12] that can be exploited for the implementation of a good approxima
tion algorithm. Intuitively, a submodular function satisfies the economic principle
of diminishing marginal returns, i.e., the marginal benefit of adding a document to
a larger collection is less than that of adding it to a smaller collection. With this
property, a greedy algorithm can be designed.
Given the top
k
documents selected by some classical ranking algorithm for the
target query, the greedy algorithm reorders these documents to maximize the objec
tive
P(S

q,S)
denote the conditional probability that query
q
belongs
to category
c
, given that all documents in set
S
fail to satisfy the user. Initially, be
fore any document is selected,
U(c

q)
.Let
U(c

q)
. The algorithm selects output
documents one at a time. At every step, it chooses the document that has the highest
marginal utility defined as below,

q,
∅
)
=
P(c

g(x

q,c,S)
=
U(c

q,S)V(x

q,c).
(6.14)
This marginal utility can be interpreted as the probability that the selected doc
ument satisfies the user given that all documents coming before it fail to do so. At
the end of the loop, the conditional distribution is updated to reflect the inclusion of
the new document to the result set using the Bayes rule.
As for the algorithm, the condition has been derived on which the greedy heuris
tic can lead to the optimal diversification result. Some further analysis shows that
even if the condition is not met, the approximation error of the greedy method can
be well bounded.
Inspired by the aforementioned new algorithm, the authors of [
1
] further propose
refining existing evaluation measures in information retrieval to be intentaware
measures. For example, now NDCG and MAP will become
NDCG
IA
@
k
=
P(c

q)
NDCG@
k,
(6.15)
c