Information Technology Reference
(denoted as the K-L divergence loss for short).
D P y (π) P π
| f(w, x ) .
x ,Ω y )
It has been proven in [ 31 ] that the above K-L divergence loss function is con-
vex, and therefore one can use a simple gradient descent method to get its global
optimum. As shown in [ 4 ], the training curve of ListNet demonstrates well the cor-
relation between the K-L divergence loss and 1
NDCG@5, although the loss is
not explicitly related to NDCG.
As one may have noticed, there is a computational issue with the ListNet algo-
rithm. Although due to the use of the scoring function to define the hypothesis, the
testing complexity of ListNet can be the same as those of the pointwise and pairwise
approaches, the training complexity of ListNet is much higher. The training com-
plexity is in the exponential order of m (and thus intractable in practice), because
the evaluation of the K-L divergence loss for each query q requires the addition of
m -factorial terms. Comparatively speaking, the training complexities of the point-
wise and pairwise approaches are roughly proportional to the number of documents
(i.e., O(m) ) and the number of document pairs (i.e., O( m) ). To tackle the problem, a
top- k version of the K-L divergence loss is further introduced in [ 4 ], which is based
on the top- k Plackett-Luce model and can reduce the training complexity from the
exponential to the polynomial order of m .
Even if the top- k K-L divergence loss is used, one still cannot avoid its following
limitations of ListNet. When k is set to be large, the time complexity of evaluating
the K-L divergence loss is still very high. However, when k is set to be small, infor-
mation about the permutation will be significantly lost and the effectiveness of the
ListNet algorithm becomes not guaranteed [ 4 ].
To tackle the problem, a new algorithm, named ListMLE, is proposed [ 31 ].
ListMLE is also based on the Plackett-Luce model. For each query q , with the
permutation probability distribution defined with the output of the scoring function,
it uses the negative log likelihood of the ground truth permutation as the loss func-
tion. 5 We denote this new loss function as the likelihood loss for short:
log P π y | f(w, x ) .
x ,π y ) =−
It is clear that in this way the training complexity has been greatly reduced as
compared to ListNet, since one only needs to compute the probability of a single
permutation π y but not all the permutations. Once again, it can be proven that this
5 Please note that one can also use the top- k Plackett-Luce model to define the likelihood loss.
However, in this case, the purpose is not to reduce the computational complexity but to better
reflect the real ranking requirements.