Data Preprocessing for Learning to Rank - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

•

Browsing features : These features are used to characterize users' interactions with

pages beyond the search result page. For example, one can compute how long

users dwell on a page or domain. Such features allow us to model intra-query

diversity of the page browsing behavior (e.g., navigational queries, on average,

are likely to have shorter page dwell time than transactional or informational

queries).

•

Click-through features : Clicks are a special case of user interaction with the

search engine. Click-through features used in [ 1 ] include the number of clicks

for the result, whether there is a click on the result below or above the current

URL, etc.

Some of the above features (e.g., click-through features and dwell time) are re-

garded as biased and only probabilistically related to the true relevance. Such fea-

tures can be represented as a mixture of two components, one is the prior “back-

ground” distribution for the value of the feature aggregated across all queries, and

the other is the component of the feature influenced by the relevance of the docu-

ments. Therefore, one can subtract the background distribution from the observed

feature value for the document at a given position. This treatment can well deal with

the position bias in the click-through data.

Given the above features (with the subtraction of the background distribution),

a general implicit feedback interpretation strategy is learned automatically instead

of relying on heuristics or insights. The general approach is to train a classifier to

induce weights for the user behavior features, and consequently derive a predictive

model of user preferences. The training is done by comparing a wide range of im-

plicit behavior features with explicit human judgments for a set of queries. RankNet

[ 4 ] is used as the learning machine.

According to the experiments conducted in [ 1 ], by using the machine learning

based approach to combine multiple pieces of evidence, one can mine more reliable

ground-truth labels for documents than purely relying on the click-through informa-

tion.

13.2.2.2 Smoothing Click-Through Data

In order to tackle the sparseness problem with the click-through data, in [ 15 ], a query

clustering technique is used to smooth the data.

Suppose we have obtained click-through information for query q and docu-

ment d . The basic idea is to propagate the click-through information to other similar

queries. In order to determine the similar queries, the co-click principle (queries for

which users have clicked on the same documents can be considered to be similar) is

employed. Specifically, a random walk model is used to derive the query similarity

in a dynamic manner.

For this purpose, a click graph that is a bipartite-graph representation of click-

through data is constructed.

n

i

m

j

{

q i }

represents a set of query nodes and

{

d j }

1

represents a set of document nodes. Then the bipartite graph can be represented

by a m

=

1

=

×

n matrix W , in which W i,j represents the click information associated

Learning to Rank for Information Retrieval

Search WWH ::

Custom Search

Home