The Semantics of Search - Social Semantics: The Search for Meaning on the Web

Information Technology Reference

In-Depth Information

would generate the query is given by the ranking function of the document. A more

sophisticated approach to language models considers that the query was a sample

from an underlying relevance model of unknown relevant documents, but that the

model could be estimated by computing the co-occurrence of the query terms with

every term in the vocabulary. In this way, the query itself was just considered a

limited sample that is automatically expanded before the search has even begun by

re-sampling the underlying relevance model.

In detail, we will now inspect the various weighting and ranking functions of the

two frameworks. A number of different options for the parameters of each weighting

function and the appropriate ranking function will be considered.

6.3.1

Vector Space Models

6.3.1.1

Representation

Each vector-space model has as a parameter the factor m , the maximum window size ,

which is the number of words, ranked in descending order of frequency, that are used

in the document models. In other words, the size of the vectors in the vector-space

model is m . Words with a zero frequency are excluded from the document model.

6.3.1.2

Weighting Function: BM25

The current state of the art weighting function for vector-space models is BM 25, one

of a family of weighting functions explored by Robertson (1994) and a descendant

of the tf.idf weighting scheme pioneered by Robertson and Sparck Jones (1976). In

particular, we will use a version of BM25 with the slight performance-enhancing

modifications used in the InQuery system (Allan et al. 2000). This weighting

scheme has been carefully optimized and routinely shows excellent performance

in TREC competitions (Craswell et al. 2005). The InQuery BM25 function assigns

the following weight to a word q occurring in a document D :

(

)

log

(

))

D q =

(6.1)

avg

log

(

log N

)

(

)

Q .Forevery

q , BM 25 calculates the number of occurrences of a term q from the query in

the document D , n

The BM 25 weighting function is summed for every term q

∈

, and then weighs this by the length of document dl of

document D in comparison to the average document length avg

(

)

(

)

.Thisisin

essence the equivalent of term frequency in tf

id f .The BM 25 weighting function

then takes into account the total number of documents N and the document

frequencies df

(

)

of the query term. This second component is the id f component

of classical tf

id f .

Social Semantics: The Search for Meaning on the Web

Search WWH ::

Custom Search

Home