Data Preprocessing for Learning to Rank - Learning to Rank for Information Retrieval

Information Technology Reference

In-Depth Information

relevance since it measures the probability of a click based on the URL. The second

one is the probability that the user is satisfied given that he has clicked on the link;

so it can been understood as a 'ratio' between actual and perceived relevance, and

the true relevance of the document can be computed as a u s u .

With the DBN model defined as above, the Expectation-Maximization (EM) al-

gorithm can be used to find the maximum likelihood estimate of the variables a u

and s u . The parameter γ is treated as a configurable parameter for the model and is

not considered in the parameter estimation process.

13.2.2 Click Data Enhancement

In the previous subsection, we have introduced various click models for ground truth

mining. These models can be effective, however, they also have certain limitations.

First, although the click information is very helpful, it is not the only information

source that can be used to mine ground-truth labels. For example, the content infor-

mation about the query and the clicked documents can also be very helpful. More

reliable labels are expected to be mined if one can use more comprehensive informa-

tion for the task. Second, it is almost unavoidable that the mined labels from click-

through logs are highly sparse. There may be three reasons: (i) the click-through

logs from a search engine company may not cover all the users' behaviors due to

its limited market share; (ii) since the search results provided by existing search en-

gines are far from perfect, it is highly possible that no document is relevant with

respect to some queries and therefore there will be no clicks for such queries; (iii)

users may issue new queries constantly, and therefore historical click-through logs

cannot cover newly issued queries.

To tackle the aforementioned problem, in [ 1 ], Agichtein et al. consider more

information to learn user interaction model using training data, and in [ 15 ], some

smoothing techniques are used to expand the sparse click data. We will introduce

these two pieces of work in detail in this subsection.

13.2.2.1 Learning a User Interaction Model

In [ 1 ], a rich set of features is used to characterize whether a user will be satisfied

with a web search result. Once the user has submitted a query, he/she will perform

many different actions (e.g., reading snippets, clicking results, navigating, and re-

fining the query). To capture and summarize these actions, three groups of features

are used: query-text, click-through, and browsing.

•

Query-text features : Users decide which results to examine in more detail by look-

ing at the result title, URL, and snippet. In many cases, looking at the original

document is not even necessary. To model this aspect of user experience, features

that characterize the nature of the query and its relation to the snippet text are

extracted, including overlap between the words in the title and in the query, the

fraction of words shared by the query and the snippet, etc.

Search WWH ::

Custom Search

Home