Database Reference
In-Depth Information
Table 10. Evaluation of the Euclidean distance (TREC 2001, 561 pairs in total; Overall
quality(Q)=D rate/R rate; T: Threshold)
AP RP NDCG P10
T D rate E rate Q D rate E rate Q D rate E rate Q D rate E rate Q
6% 78.4% 33.0% 2.4 75.4% 29.0% 2.6 66.1% 28.1% 2.4 73.3% 25.1% 2.9
9% 67.0% 30.2% 2.2 59.4% 25.6% 2.3 50.1% 23.1% 2.2 64.3% 23.4% 2.7
12% 56.1% 27.9% 2.0 51.1% 23.0% 2.2 36.5% 18.4% 2.0 56.0% 21.4% 2.6
15% 46.7% 24.9% 1.9 42.8% 20.5% 2.1 27.3% 14.3% 1.9 49.4% 18.7% 2.6
18% 37.8% 21.9% 1.7 33.2% 17.7% 1.9 20.3% 11.3% 1.8 43.5% 17.3% 2.5
21% 30.1% 19.1% 1.6 26.6% 15.4% 1.7 14.8%
8.7% 1.7 38.5% 15.3% 2.5
24% 23.9% 16.8% 1.4 20.7% 13.5% 1.5 10.7%
7.8% 1.4 35.1% 14.5% 2.4
27% 20.0% 15.6% 1.3 14.8% 12.7% 1.2
8.7%
7.8% 1.1 29.1% 13.1% 2.2
30% 16.0% 14.1% 1.1 11.2% 11.6% 1.0
6.0%
7.2% 0.8 23.5% 12.2% 1.9
33% 12.1% 12.3% 1.0
8.2% 11.5% 0.7
4.5%
6.5% 0.7 18.9% 11.1% 1.7
as good as the cubic model in both year groups. Therefore, the “overall quality”
of the cubic model is not as good as the linear model.
Now let us look at the four ranking based metrics. For differentiation rate,
P10 performed best and AP performed second to the best in both year groups.
However, for error rate, AP was the worst in both year groups (see Figure 1).
That is why its “overall performance ” is the worst among the four metrics. On
the other hand, NDCG has the lowest differentiation rate and error rate in both
year groups, and has the best “overall quality”. AP is in the second place, and
very close to NDCG in “overall quality”.
If we consider the “overall quality” of all the metrics, then the averages are:
ED(L): 7.1; NDCG: 4.1; ED(C): 3.8; P10: 3.7; RP; 3.4; AP: 2.9. The Euclidean
distance with the linear model is the best, which has a much higher average score
than all the others. It is quite surprising to see that average precision (AP) is
the worst, with an average score of 2.9. In the information retrieval community,
AP is commonly used and regarded as a very good system-oriented metric for
retrieval evaluation [2,8].
3.3 Experiment 3
The above experiment should be reliable for the comparison of the two variations
of the Euclidean distance, or the comparison of all those ranking based metrics
separately. However, since the threshold settings of performance difference are
different for the Euclidean distance and ranking based metrics, the conclusions
may not be very convincing for the comparison of the Euclidean distance and
ranking based metrics. In order to evaluate all these metrics in a more compara-
ble style, we carried out another experiment. This time we set up a fixed group of
differentiation rates (0.2, 0.25, ..., 0.6), and then find the corresponding thresh-
old for each of them. Note that the differentiation rate decreases monotonously
when the threshold increases. Using a threshold as such for a given differentiation
rate, we find its corresponding error rate. Figures 2-3 shows the results.
Search WWH ::




Custom Search