Generalization Analysis for Ranking - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

Table 17.1 N(ϕ) and

C A (ϕ) for ListNet and

ListMLE

ϕN(ϕ)

C ListMLE (ϕ)

C ListNet (ϕ)

2 m

ϕ L a

log b + aBM

−

aBM)( log m

aBM )

−

aBM)( log m

aBM )

−

2 e aBM

log m

e aBM

ϕ E ae aBM

2 m

2 aBM

log m

2 aBM

ae aBM

e aBM )

2 ( 1

2 m

( 1

ϕ S

( 1

+ e − aBM ) 2

log m

aBM

log m

aBM

17.2.3 For Two-Layer Ranking

As can be seen, the generalization bounds introduced in the previous subsections

are over either documents or queries. This is mainly because of the statistical frame-

works that they have used. As a result, they cannot explain some empirical observa-

tions. For example, the experiments in [ 18 ] show that given a fixed total number of

judgments, in order to achieve the best test performance, neither extremely deep la-

beling (i.e., labeling a large number of queries and only a few documents per query)

nor extremely shallow labeling (i.e., labeling a few queries but a large number of

documents per query) is a good choice. This is clearly not in accordance with any

of the above generalization bounds.

To address the aforementioned issues, in [ 8 ], generalization analysis for the pair-

wise ranking algorithms is conducted, under the two-layer ranking framework. The

expected and empirical risks are defined as in (16.17) and (16.18) (see Chap. 16).

In particular, the following result has been obtained in [ 8 ], also based on the

concept of Rademacher average (RA) [ 4 , 5 ].

Theorem 17.6 Suppose L is the loss function for pairwise ranking . Assume (1)

◦ F

is bounded by M ,(2) E

[ R m (L

◦ F

)

]≤

D(L

◦ F

,m) , then with probability

at least 1

−

δ( 0 <δ< 1 ) ,

∀

∈ F

2 M 2 log ( δ )

D L

, m (i)

≤ R(f )

R(f )

D(L

◦ F

,n)

◦ F

i =

2 M 2 log δ

m (i) n 2

where for certain ranking function class, D(L

◦ F

,m) has its explicit form. For

F ={

f(x,x )

f(x )

example, for the function class

f(x)

−

;

∈ F }

whos e VC

c 1 √ V/m

dimension is V , and

f(x)

|≤

B , i t has been proven that D(L 0 ◦ F

,m)

c 2 Bφ (B) √ V/m in [ 4 , 5 ], where c 1 and c 2 are both constants.

According to the above theorem, one may have the following discussions.

and D(L φ ◦ F

,m)

•

The increasing number of either queries or documents per query in the training

data will enhance the two-layer generalization ability.

and m (i)

•

Only if n

→∞

simultaneously does the two-layer generalization

bound uniformly converge.

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home