Information Technology Reference
In-Depth Information
In terms of “Recall” (upper bounds of recall), shown in tables 4 and 5, one of our
goals was to increase the Czech recall, which we believe to have accomplished. In the
same line with precision, the metrics based on Tf-Idf and Phi-Square have better
recall values, in table 4, than those obtained for Rvar and MI-based metrics, in table 5.
We have chosen to present “recall” values for the top 20 ranked relevant terms as
these values are higher than for 5 or 10 best ranked terms. Recall values obtained for
Rvar and MI derived metrics (Table 5) are much lower than those obtained for Tf-Idf
and Phi-Square derived metrics, as Rvar and MI derived metrics choose rare terms
that may specify very specific subject matters of documents.
Table 3. Precision values for the 5, 10 and 20 best terms using the Rvar and MI best metrics,
and average for each threshold
Czech
Metrics
Prec. (5)
Prec. (10)
Prec. (20)
LBM Rvar
0.50
0.39
0.27
LM Rvar
0.45
0.31
0.22
LBM MI
0.40
0.40
0.26
LM MI
0.45
0.31
0.22
Average
0.45
0.35
0.24
English
Metrics
Prec. (5)
Prec. (10)
Prec. (20)
LBM Rvar
0.52
0.43
0.40
LM Rvar
0.47
0.42
0.35
LBM MI
0.46
0.49
0.43
LM MI
0.47
0.42
0.34
Average
0.48
0.44
0.38
Portuguese
Metrics
Prec. (5)
Prec. (10)
Prec. (20)
LBM Rvar
0.52
0.48
0.41
LM Rvar
0.46
0.36
0.35
LBM MI
0.52
0.48
0.43
LM MI
0.42
0.35
0.33
Average
0.48
0.42
0.38
Table 4. “Recall” Values for threshold of 20 best terms for Tf-Idf and Phi Square based
metrics, and average recall
Czech
English
Portuguese
P(20)
P(20)
P(20)
tfidf
0.68
0.43
0.48
L Tf-Idf
0.56
0.48
0.46
LM tfidf
0.52
0.43
0.44
LB tfidf
0.60
0.38
0.37
LBM tfidf
0.54
0.35
0.40
ϕ
2
0.50
0.44
0.48
0.50
0.41
0.36
L ϕ2
LM ϕ2
0.51
0.43
0.37
LB
ϕ
2
0.40
0.37
0.33
LBM ϕ2
0.43
0.41
0.35
Average
0.54
0.41
0.40
Search WWH ::




Custom Search