Information Technology Reference
To further model the relevance degree, the Vector Space model (VSM) was pro-
posed [ 2 ]. Both documents and queries are represented as vectors in a Euclidean
space, in which the inner product of two vectors can be used to measure their sim-
ilarities. To get an effective vector representation of the query and the documents,
TF-IDF weighting has been widely used. 7 The TF of a term t in a vector is defined
as the normalized number of its occurrences in the document, and the IDF of it is
defined as follows:
where N is the total number of documents in the collection, and n(t) is the number
of documents containing term t .
While VSM implies the assumption on the independence between terms, Latent
Semantic Indexing (LSI) [ 23 ] tries to avoid this assumption. In particular, Singular
Value Decomposition (SVD) is used to linearly transform the original feature space
to a “latent semantic space”. Similarity in this new space is then used to define the
relevance between the query and the documents.
As compared with the above, models based on the probabilistic ranking princi-
ple [ 50 ] have garnered more attention and achieved more success in past decades.
The famous ranking models like the BM25 model 8 [ 65 ] and the language model
for information retrieval (LMIR) , can both be categorized as probabilistic ranking
The basic idea of BM25 is to rank documents by the log-odds of their relevance.
Actually BM25 is not a single model, but defines a whole family of ranking models,
with slightly different components and parameters. One of the popular instantiations
of the model is as follows.
Given a query q , containing terms t 1 ,...,t M , the BM25 score of a document d
is computed as
IDF(t i )
TF(t i ,d)
(k 1 +
BM 25 (d, q)
TF(t i ,d)
k 1 ·
where TF(t, d) is the term frequency of t in document d , LEN(d) is the length (num-
ber of words) of document d , and avdl is the average document length in the text
collection from which documents are drawn. k 1 and b are free parameters, IDF(t)
is the IDF weight of the term t , computed by using ( 1.1 ), for example.
LMIR [ 57 ] is an application of the statistical language model on information
retrieval. A statistical language model assigns a probability to a sequence of terms.
When used in information retrieval, a language model is associated with a document.
7 Note that there are many different definitions of TF and IDF in the literature. Some are purely
based on the frequency and the others include smoothing or normalization [ 70 ]. Here we just give
some simple examples to illustrate the main idea.
8 The name of the actual model is BM25. In the right context, however, it is usually referred to as
“OKapi BM25”, since the OKapi system was the first system to implement this model.