Information Technology Reference
In-Depth Information
Table 3. Statistics of the collected sequences
Data set
No. of sequences No. of distinct questions
%
Earth sciences
61
90
28.75
Nutrition
46
94
29.56
Homeschooling
56
84
43.98
3.3 Evaluation Metrics
Evaluating a recommender system on its prediction power is crucial, but insuf-
ficient in order to deploy a good recommendation engine [ 17 ]. There are other
measures that reflect various aspects. However, not all of them are desired to
perform well for every recommender.
Therefore, the evaluation of the LoR model should not be based on prediction
performance (accuracy and average log-loss) alone, but also on other metrics that
capture various desired aspects of a learning-oriented recommender within a QA
system. Let us briefly define these metrics.
Catalog Coverage. In general, catalog coverage represents the proportion of
questions that the recommendation model can recommend. In our case, we define
the catalog coverage as the proportion of questions that the model P can rec-
ommend with a prediction value higher than a predefined threshold ˃ .
Overall, all three recommender models introduced in Section 2 can gener-
ate recommendations for any user (i.e. full user space coverage) and, eventually,
all questions can be recommended, since the recommender repeatedly excludes
already visited ones. But, towards the exhaustion of the database, the recom-
mendations will have a very low prediction value. These recommendations are
unreliable. Therefore, we introduce the prediction threshold ˃ .
In our evaluation, we generally set ˃ to be the lowest prediction value among
the questions within the sequences used for training. Since the user space cover-
age is equal for all recommender models, we will further refer to catalog coverage
simply as “coverage”.
Diversity. Generally, diversity is defined as the opposite of similarity. Within
this context, we define the diversity as the average dissimilarity among each
question pair within a recommendation.
Let s be a question sequence context. Then, the diversity of R ( s ) is defined
as
2
div ( R ( s )) =
[1
sim q ( q i ,q j )] ,
(13)
N
·
( N
1)
( q i ,q j ) R ( s )
i<j
where sim q :
Q×Q ₒ
[0 , 1] represents the semantic similarity measure between
questions.
During the evaluation, we used the simple cosine similarity together with
the semantic concept similarity defined by Lin [ 14 ]. In order to avoid further
 
Search WWH ::




Custom Search