The Semantics of Search - Social Semantics: The Search for Meaning on the Web

Information Technology Reference

In-Depth Information

6.3.2.2

Language Modeling Baseline

When no relevance judgments are available, the language modeling approach ranks

documents D by the probability that the query Q could be observed during repeated

random sampling from the distribution u D ( · )

. The typical sampling process assumes

that words are drawn independently, with replacement, leading to the following

retrieval score being assigned to document D :

)= q ∈ Q u D ( q )

(

(6.7)

The ranking function in ( 6.7 ) is called query-likelihood ranking and is used as a

baseline for our language-modeling experiments.

6.3.2.3

Language Models and Relevance Feedback

The classical language-modeling approach to IR does not provide a natural mecha-

nism to perform relevance feedback. However, a popular extension of the approach

involves estimating a relevance-based model u R in addition to the document-based

model u D , and comparing the resulting language models using information-theoretic

measures. Estimation of u D has been described above, so this section will describe

two ways of estimating the relevance model u R , and a way of measuring distance

between u Q and u D for the purposes of document ranking.

Let R

r k be the set of k relevant documents, identified during the feedback

process. One way of constructing a language model of R is to average the document

models of each document in the set:

r 1 ...

i = 1 u r i ( w )=

i = 1

(

r i

)

u R , avg (

(6.8)

r i |

is the number of times the word w occurs in the i th relevant document,

Here n

(

r i )

and

is the length of that document. Another way to estimate the same distribution

would be to concatenate all relevant documents into one long string of text, and

count word frequencies in that string:

r i |

1 n

(

r i )

)= ∑

u R , con (

(6.9)

r i |

∑

i = 1 n

(

)

Here the numerator

represents the total number of times the word

w occurs in the concatenated string, and the denominator is the length of the

concatenated string. The difference between ( 6.8 )and( 6.9 ) is that the former

treats every document equally, regardless of its length, whereas the latter favors

longer documents (they are not individually penalized by dividing their contributing

frequencies n

∑

r i

(

)

r i

by their length

r i

Social Semantics: The Search for Meaning on the Web

Search WWH ::

Custom Search

Home