Information Technology Reference
InDepth Information
(denoted as the
KL divergence loss
for short).
D
P
y
(π)
P
π

f(w,
x
)
.
L(f
;
x
,Ω
y
)
=
(4.23)
It has been proven in [
31
] that the above KL divergence loss function is con
vex, and therefore one can use a simple gradient descent method to get its global
optimum. As shown in [
4
], the training curve of ListNet demonstrates well the cor
relation between the KL divergence loss and 1
−
NDCG@5, although the loss is
not explicitly related to NDCG.
As one may have noticed, there is a computational issue with the ListNet algo
rithm. Although due to the use of the scoring function to define the hypothesis, the
testing complexity of ListNet can be the same as those of the pointwise and pairwise
approaches, the training complexity of ListNet is much higher. The training com
plexity is in the exponential order of
m
(and thus intractable in practice), because
the evaluation of the KL divergence loss for each query
q
requires the addition of
m
factorial terms. Comparatively speaking, the training complexities of the point
wise and pairwise approaches are roughly proportional to the number of documents
(i.e.,
O(m)
) and the number of document pairs (i.e.,
O( m)
). To tackle the problem, a
top
k
version of the KL divergence loss is further introduced in [
4
], which is based
on the top
k
PlackettLuce model and can reduce the training complexity from the
exponential to the polynomial order of
m
.
4.3.2 ListMLE
Even if the top
k
KL divergence loss is used, one still cannot avoid its following
limitations of ListNet. When
k
is set to be large, the time complexity of evaluating
the KL divergence loss is still very high. However, when
k
is set to be small, infor
mation about the permutation will be significantly lost and the effectiveness of the
ListNet algorithm becomes not guaranteed [
4
].
To tackle the problem, a new algorithm, named ListMLE, is proposed [
31
].
ListMLE is also based on the PlackettLuce model. For each query
q
, with the
permutation probability distribution defined with the output of the scoring function,
it uses the negative log likelihood of the ground truth permutation as the loss func
tion.
5
We denote this new loss function as the
likelihood loss
for short:
log
P
π
y

f(w,
x
)
.
L(f
;
x
,π
y
)
=−
(4.24)
It is clear that in this way the training complexity has been greatly reduced as
compared to ListNet, since one only needs to compute the probability of a single
permutation
π
y
but not all the permutations. Once again, it can be proven that this
5
Please note that one can also use the top
k
PlackettLuce model to define the likelihood loss.
However, in this case, the purpose is not to reduce the computational complexity but to better
reflect the real ranking requirements.