Database Reference
In-Depth Information
NDCG needs to set a parameter c , which we set to 2 as in [4]. As to AP, RP,
and P10, we use their expanded form [12] described above for 3 graded relevance
judgment. Two variations of the Euclidean distances were involved. They are the
linear model and the cubic model for the estimation of relation between rank and
relevance in resultant lists. Note that the raw scores of retrieved documents from in-
formation retrieval systems are not used in both variations. From this perspective,
they are somewhat like ranking based metrics. This is the interesting part.
We applied the cubic model to all the selected runs in each year group and ob-
tained the values of 4 parameters by regression analysis: For TREC 9, the values
of the four parameters are: a 0 =0.2474, a 1 =-0.0639, a 2 =0.0036, and a 3 =0.0001.
For TREC 2001, the values of the four parameters are: a 0 =0.3267, a 1 =-0.0761,
a 2 =0.0032, and a 3 =0.0002. Thus, we can assign proper relevance scores to doc-
uments at different ranks. However, note that those parameter values are only
reasonably good for the estimation of rank-relevance of each individual result.
We avoided using the best possible parameter values for every individual result,
so as to be fair to the other metrics.
3 Experiments
A few different aspects of these metrics are compared through three groups of
experiments. Let us discuss them one by one.
3.1 Experiment 1
First, for all the selected runs in a year group (TREC 9 or TREC 2001), we
evaluated the effectiveness of them over 50 queries using all 6 metrics. Then
Pearson's correlation coecients were calculated for the different rankings of
the information retrieval systems obtained by using different metrics. The cor-
relation coecients are shown in Tables 1 and 2, for TREC 9 and TREC 2001,
respectively.
In all the cases, the correlations are significant at the 0.01 level (2-tailed).
Tables 1 and 2 shows that, generally speaking, there is a strong correlation
between any of the two variations of the Euclidean distance and any of the four
ranking based metrics. However, the strength of correlation varies across the
two year groups and the two variations of the Euclidean distance. The smallest
is .624 (between ED(L) and P10 in TREC 9) and the biggest is .981 (between
ED(C) and NDCG in TREC 2001).
We also carried out the linear regression analysis for those values of different
metrics. Tables 3-6 shows the coecients and significance of the analysis. The
Euclidean distance can be well or reasonably well expressed linearly using any
of the four metrics. Among them, NDCG is always the best ( R
2 =0.946, 0.664,
2 = x , then it means that NDCG can explain x %ofthe
variation in the Euclidean distance and vice versa) and P10 ( R
0.799, and 0.962; if R
2 =0.782, 0.389,
0.484, and 0.754) is always the least able to express the Euclidean distance, while
AP and RP are in the second and third places, respectively.
 
Search WWH ::




Custom Search