Data Preprocessing for Learning to Rank - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

In this chapter, we will try to answer the above questions. In particular, we will

first introduce various models for user click behaviors and discuss how they can be

used to automatically mine ground-truth labels for learning to rank. Then we will

discuss the problem of data selection for learning to rank, which includes document

selection for labeling, and document/feature selection for training.

13.2 Ground Truth Mining from Logs

13.2.1 User Click Models

Most commercial search engines log users' click behaviors during their interaction

with the search interface. Such click logs embed important clues about user satis-

faction with a search engine and can provide a highly valuable source of relevance

information. As compared to human judgment, click information is much cheaper

to obtain and can reflect the up-to-date relevance (relevance will change along with

time). However, clicks are also known to be biased and noisy. Therefore, it is neces-

sary to develop some models to remove the bias and noises in order to obtain reliable

relevance labels.

Classical click models include the position models [ 10 , 13 , 28 ] and the cascade

model [ 10 ]. A position model assumes that a click depends on both relevance and

examination. Each document has a certain probability of being examined, which

decays by and only depends on rank positions. A click on a document indicates

that the document is examined and considered relevant by the user. However this

model treats the individual documents in a search result page independently and fails

to capture the interdependency between documents in the examination probability.

The cascade model assumes that users examine the results sequentially and stop

as soon as a relevant document is clicked. Here, the probability of examination is

indirectly determined by two factors: the rank of the document and the relevance of

all previous documents. The cascade model makes a strong assumption that there

is only one click per search and hence it could not explain the abandoned search or

search with multiple clicks.

To sum up, there are at least the following problems with the aforementioned

classical models.

•

The models cannot effectively deal with multiple clicks in a session.

•

The models cannot distinguish perceived relevance and actual relevance. Because

users cannot examine the content of a document until they click on the document,

the decision to click is made based on perceived relevance. While there is a strong

correlation between perceived relevance and actual relevance, there are also many

cases where they differ.

•

The models cannot naturally lead to a preference probability on a pair of docu-

ments, while such preference information is required by many pairwise ranking

methods.

Search WWH ::

Custom Search

Home