Information Technology Reference
In-Depth Information
6.3.2.2
Language Modeling Baseline
When no relevance judgments are available, the language modeling approach ranks
documents D by the probability that the query Q could be observed during repeated
random sampling from the distribution u D ( · )
. The typical sampling process assumes
that words are drawn independently, with replacement, leading to the following
retrieval score being assigned to document D :
)= q Q u D ( q )
P
(
Q
|
D
(6.7)
The ranking function in ( 6.7 ) is called query-likelihood ranking and is used as a
baseline for our language-modeling experiments.
6.3.2.3
Language Models and Relevance Feedback
The classical language-modeling approach to IR does not provide a natural mecha-
nism to perform relevance feedback. However, a popular extension of the approach
involves estimating a relevance-based model u R in addition to the document-based
model u D , and comparing the resulting language models using information-theoretic
measures. Estimation of u D has been described above, so this section will describe
two ways of estimating the relevance model u R , and a way of measuring distance
between u Q and u D for the purposes of document ranking.
Let R
r k be the set of k relevant documents, identified during the feedback
process. One way of constructing a language model of R is to average the document
models of each document in the set:
=
r 1 ...
k
i = 1 u r i ( w )=
k
i = 1
1
k
1
k
n
(
w
,
r i
)
u R , avg (
w
)=
(6.8)
|
r i |
is the number of times the word w occurs in the i th relevant document,
Here n
(
w
,
r i )
and
is the length of that document. Another way to estimate the same distribution
would be to concatenate all relevant documents into one long string of text, and
count word frequencies in that string:
|
r i |
i
1 n
(
w
,
r i )
)=
=
u R , con (
w
(6.9)
k
i
|
r i |
=
1
k
i = 1 n
(
,
)
Here the numerator
represents the total number of times the word
w occurs in the concatenated string, and the denominator is the length of the
concatenated string. The difference between ( 6.8 )and( 6.9 ) is that the former
treats every document equally, regardless of its length, whereas the latter favors
longer documents (they are not individually penalized by dividing their contributing
frequencies n
w
r i
(
,
)
|
|
w
r i
by their length
r i
).
 
Search WWH ::




Custom Search