Using the Euclidean Distance for Retrieval Evaluation - Advances in Databases

Database Reference

In-Depth Information

query Q ,aresult R from an information retrieval system IR is

,

here s i is the score assigned to document d i by IR . On the other hand, for every

document d i , it has an ideal relevance score o i . If binary relevance judgment is

used, then the ideal score for any judged relevant document is 1 and that for

any judged irrelevant document is 0. If 3 graded relevance judgment is used,

then 0, 0.5, and 1 can be used as ideal scores of irrelevant, modestly relevant,

and highly relevant documents, respectively. In such a situation, the Euclidean

distance between the scores in R and the ideal scores O =

{s 1 ,s 2 , ..., s n }

{o 1 ,o 2 , ..., o n }

can be

calculated by

n

distance ( R, O )=

( s i − o i ) 2

(1)

i =1

distance ( R, O ) can be used as a metric to evaluate the effectiveness of R .Ifall

documents' scores are estimated accurately, then we can expect a low Euclidean

distance value; otherwise, we can expect a high one. The Euclidean distance has

been widely used as a metric in many areas such as data mining, neural net-

works, etc. However, to our knowledge, it has never been explored in information

retrieval. Different from those ranking based metrics, the Euclidean distance can

be regarded as a relevance score based metric. It is an interesting thing to find

out if the Euclidean distance is a good choice for the evaluation of information

retrieval results.

At first glance, one may think that the condition for using this metric is

very rigorous since relevance scores for all the documents in the whole collection

need to be provided. This is only true theoretically. In practice, we may use some

reasonable approximation methods. In TREC, only a certain number (say, 1000)

of documents are included in a result for a query. It is not known what the scores

are for those documents that do not occur in the result. Since those documents

are very likely irrelevant, we can reasonably assign a default score of 0 to all of

them. Another situation is: although some retrieval systems provide scores for

all retrieved documents, those scores may be in various ranges (say, one is in the

range of 1 to 1000 and another in the range of 0 to 10) and cannot be used directly

as relevance scores. In such a situation, we may use score normalization methods

to normalize all the scores to the desired range [0,1]. Several such methods have

been investigated before [5,6,11]. Finally, some retrieval systems may not provide

scores at all for the retrieved documents. Then we may assign different scores

to documents at different ranks in a systematic manner. One straightforward

method for this is the linear model: for a ranked list of m documents, the top

ranked document is given a score of 1, the 2nd document in the list is assigned

ascoreof( m−

1) /m ,..., the last document in the list is assigned a score of 1 /m .

This is the method used in Borda voting. It has been used for data fusion [1]

in information retrieval as well. Alternatively, we may use nonlinear models to

estimate scores for documents at different ranks. For example, we may use the

cubic model [9] or other models (e.g., the logistic model [3]) to do this. The cubic

model can be expressed by the following equation

Advances in Databases

Search WWH ::

Custom Search

Home