The Listwise Approach - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

(denoted as the K-L divergence loss for short).

D P y (π) P π

| f(w, x ) .

L(f

;

x ,Ω y )

=

(4.23)

It has been proven in [ 31 ] that the above K-L divergence loss function is con-

vex, and therefore one can use a simple gradient descent method to get its global

optimum. As shown in [ 4 ], the training curve of ListNet demonstrates well the cor-

relation between the K-L divergence loss and 1

−

NDCG@5, although the loss is

not explicitly related to NDCG.

As one may have noticed, there is a computational issue with the ListNet algo-

rithm. Although due to the use of the scoring function to define the hypothesis, the

testing complexity of ListNet can be the same as those of the pointwise and pairwise

approaches, the training complexity of ListNet is much higher. The training com-

plexity is in the exponential order of m (and thus intractable in practice), because

the evaluation of the K-L divergence loss for each query q requires the addition of

m -factorial terms. Comparatively speaking, the training complexities of the point-

wise and pairwise approaches are roughly proportional to the number of documents

(i.e., O(m) ) and the number of document pairs (i.e., O( m) ). To tackle the problem, a

top- k version of the K-L divergence loss is further introduced in [ 4 ], which is based

on the top- k Plackett-Luce model and can reduce the training complexity from the

exponential to the polynomial order of m .

4.3.2 ListMLE

Even if the top- k K-L divergence loss is used, one still cannot avoid its following

limitations of ListNet. When k is set to be large, the time complexity of evaluating

the K-L divergence loss is still very high. However, when k is set to be small, infor-

mation about the permutation will be significantly lost and the effectiveness of the

ListNet algorithm becomes not guaranteed [ 4 ].

To tackle the problem, a new algorithm, named ListMLE, is proposed [ 31 ].

ListMLE is also based on the Plackett-Luce model. For each query q , with the

permutation probability distribution defined with the output of the scoring function,

it uses the negative log likelihood of the ground truth permutation as the loss func-

tion. 5 We denote this new loss function as the likelihood loss for short:

log P π y | f(w, x ) .

L(f

;

x ,π y ) =−

(4.24)

It is clear that in this way the training complexity has been greatly reduced as

compared to ListNet, since one only needs to compute the probability of a single

permutation π y but not all the permutations. Once again, it can be proven that this

5 Please note that one can also use the top- k Plackett-Luce model to define the likelihood loss.

However, in this case, the purpose is not to reduce the computational complexity but to better

reflect the real ranking requirements.

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home