Information Technology Reference
In-Depth Information
would generate the query is given by the ranking function of the document. A more
sophisticated approach to language models considers that the query was a sample
from an underlying relevance model of unknown relevant documents, but that the
model could be estimated by computing the co-occurrence of the query terms with
every term in the vocabulary. In this way, the query itself was just considered a
limited sample that is automatically expanded before the search has even begun by
re-sampling the underlying relevance model.
In detail, we will now inspect the various weighting and ranking functions of the
two frameworks. A number of different options for the parameters of each weighting
function and the appropriate ranking function will be considered.
6.3.1
Vector Space Models
6.3.1.1
Representation
Each vector-space model has as a parameter the factor m , the maximum window size ,
which is the number of words, ranked in descending order of frequency, that are used
in the document models. In other words, the size of the vectors in the vector-space
model is m . Words with a zero frequency are excluded from the document model.
6.3.1.2
Weighting Function: BM25
The current state of the art weighting function for vector-space models is BM 25, one
of a family of weighting functions explored by Robertson (1994) and a descendant
of the tf.idf weighting scheme pioneered by Robertson and Sparck Jones (1976). In
particular, we will use a version of BM25 with the slight performance-enhancing
modifications used in the InQuery system (Allan et al. 2000). This weighting
scheme has been carefully optimized and routinely shows excellent performance
in TREC competitions (Craswell et al. 2005). The InQuery BM25 function assigns
the following weight to a word q occurring in a document D :
n
(
q
,
D
)
log
(
0
.
5
+
N
/
df
(
q
))
D q =
(6.1)
dl
avg
log
(
1
.
0
+
log N
)
n
(
q
,
D
)+
0
.
5
+
1
.
5
(
dl
)
Q .Forevery
q , BM 25 calculates the number of occurrences of a term q from the query in
the document D , n
The BM 25 weighting function is summed for every term q
, and then weighs this by the length of document dl of
document D in comparison to the average document length avg
(
q
,
D
)
(
dl
)
.Thisisin
essence the equivalent of term frequency in tf
id f .The BM 25 weighting function
then takes into account the total number of documents N and the document
frequencies df
.
(
q
)
of the query term. This second component is the id f component
of classical tf
.
id f .
 
Search WWH ::




Custom Search