Information Technology Reference

In-Depth Information

To further model the relevance degree, the
Vector Space model
(VSM) was pro-

posed [
2
]. Both documents and queries are represented as vectors in a Euclidean

space, in which the inner product of two vectors can be used to measure their sim-

ilarities. To get an effective vector representation of the query and the documents,

TF-IDF weighting has been widely used.
7
The TF of a term
t
in a vector is defined

as the normalized number of its occurrences in the document, and the IDF of it is

defined as follows:

N

n(t)

IDF(t)
=

log

(1.1)

where
N
is the total number of documents in the collection, and
n(t)
is the number

of documents containing term
t
.

While VSM implies the assumption on the independence between terms,
Latent

Semantic Indexing
(LSI) [
23
] tries to avoid this assumption. In particular, Singular

Value Decomposition (SVD) is used to linearly transform the original feature space

to a “latent semantic space”. Similarity in this new space is then used to define the

relevance between the query and the documents.

As compared with the above, models based on the probabilistic ranking princi-

ple [
50
] have garnered more attention and achieved more success in past decades.

The famous ranking models like the
BM25 model
8
[
65
] and the
language model

for information retrieval (LMIR)
, can both be categorized as probabilistic ranking

models.

The basic idea of BM25 is to rank documents by the log-odds of their relevance.

Actually BM25 is not a single model, but defines a whole family of ranking models,

with slightly different components and parameters. One of the popular instantiations

of the model is as follows.

Given a query
q
, containing terms
t
1
,...,t
M
, the BM25 score of a document
d

is computed as

M

IDF(t
i
)

·

TF(t
i
,d)

·

(k
1
+

1
)

BM
25
(d, q)

=

,

(1.2)

LEN(d)

avdl

TF(t
i
,d)

+

k
1
·

(
1

−

b

+

b

·

)

i

=

1

where
TF(t, d)
is the term frequency of
t
in document
d
,
LEN(d)
is the length (num-

ber of words) of document
d
, and
avdl
is the average document length in the text

collection from which documents are drawn.
k
1
and
b
are free parameters,
IDF(t)

is the IDF weight of the term
t
, computed by using (
1.1
), for example.

LMIR [
57
] is an application of the statistical language model on information

retrieval. A statistical language model assigns a probability to a sequence of terms.

When used in information retrieval, a language model is associated with a document.

7
Note that there are many different definitions of TF and IDF in the literature. Some are purely

based on the frequency and the others include smoothing or normalization [
70
]. Here we just give

some simple examples to illustrate the main idea.

8
The name of the actual model is BM25. In the right context, however, it is usually referred to as

“OKapi BM25”, since the OKapi system was the first system to implement this model.