Information Technology Reference
In-Depth Information
would generate the query is given by the ranking function of the document. A more
sophisticated approach to language models considers that the query was a sample
from an underlying
relevance model
of unknown relevant documents, but that the
model could be estimated by computing the co-occurrence of the query terms with
every term in the vocabulary. In this way, the query itself was just considered a
limited sample that is automatically expanded before the search has even begun by
re-sampling the underlying relevance model.
In detail, we will now inspect the various weighting and ranking functions of the
two frameworks. A number of different options for the parameters of each weighting
function and the appropriate ranking function will be considered.
6.3.1
Vector Space Models
6.3.1.1
Representation
Each vector-space model has as a parameter the factor
m
, the maximum
window size
,
which is the number of words, ranked in descending order of frequency, that are used
in the document models. In other words, the size of the vectors in the vector-space
model is
m
. Words with a zero frequency are excluded from the document model.
6.3.1.2
Weighting Function: BM25
The current state of the art weighting function for vector-space models is
BM
25, one
of a family of weighting functions explored by Robertson (1994) and a descendant
of the
tf.idf
weighting scheme pioneered by Robertson and Sparck Jones (1976). In
particular, we will use a version of
BM25
with the slight performance-enhancing
modifications used in the InQuery system (Allan et al. 2000). This weighting
scheme has been carefully optimized and routinely shows excellent performance
in TREC competitions (Craswell et al. 2005). The InQuery BM25 function assigns
the following weight to a word
q
occurring in a document
D
:
n
(
q
,
D
)
log
(
0
.
5
+
N
/
df
(
q
))
D
q
=
(6.1)
dl
avg
log
(
1
.
0
+
log
N
)
n
(
q
,
D
)+
0
.
5
+
1
.
5
(
dl
)
Q
.Forevery
q
,
BM
25 calculates the number of occurrences of a term
q
from the query in
the document
D
,
n
The
BM
25 weighting function is summed for every term
q
∈
, and then weighs this by the length of document
dl
of
document
D
in comparison to the average document length
avg
(
q
,
D
)
(
dl
)
.Thisisin
essence the equivalent of term frequency in
tf
id f
.The
BM
25 weighting function
then takes into account the total number of documents
N
and the document
frequencies
df
.
(
q
)
of the query term. This second component is the
id f
component
of classical
tf
.
id f
.
Search WWH ::
Custom Search