Databases Reference
In-Depth Information
list to the total number of all like statements, formally recall is measured as
recall = |
PLS
ELS
|
(2.13)
|
ELS
|
and describes how many of the existing like statements were found by the recommender.
In the evaluation procedure, recommendations and the corresponding precision and recall values are
calculated for all users in the data set and then averaged. Note that both precision and recall values have
to be considered simultaneously because improving one is usually at the cost of the other. Therefore, the
averaged precision and recall values are combined in the F1-score, where
recall
precision + recall
precision
F 1=2
(2.14)
2.2.2 User studies
As opposed to oine experiments, user studies require much more user effort and time [Wildemuth, 2003].
In a user study typically a small group of volunteer test persons are recruited for the experiment. The
test persons are expected to interact with a recommender system and to report their experience with
the system. The experiment is either conducted in a laboratory environment or in some other locations,
e.g., in private locations. The participants are monitored and interviewed either before, during, or after
the experiment. Usually, the results from a user study are used to test some hypotheses which have
been formulated by the researcher before the experiment was actually designed. When conducting a user
study the question for the appropriate number of users to be recruited for the experiment arises. We can
consider the number of recruited users for an experiment large enough when the results are statistically
significant according to a statistical significance test [Demsar, 2006; Shani and Gunawardana, 2011]. It
is important to know that the statistical significance test has to be selected carefully because some of
the tests are either not strong enough to detect existing significant differences or may even lead to false
detections of significance in the data where there is no significance at all, see, for example, [Demsar, 2006]
and [Smucker et al., 2007] for a comparison of statistical significance tests.
User studies are more costly than simple o ine experiments and harder to conduct, but amongst the
three different evaluation types discussed here, they can perhaps answer the widest range of question
types according to [Shani and Gunawardana, 2011]. In a user study both types of results - quantitative
and qualitative results - can be collected. Quantitative results can be recorded by monitoring the user
behavior. A quantitative result would be, for instance, the time needed by a user to complete a task
which can be measured implicitly. In [Gedikli et al., 2011b], for example, we measure how different
explanation interfaces can reduce the user's decision-making time (see also Chapters 5 and 6). We use
a direct measurement and compute the time difference for completing the same task of decision making
with and without an explanation facility or across different explanation facilities.
On the other hand, qualitative data can also be collected in a user study, e.g., by explicitly asking
questions to the users related to their feelings towards the recommender system. Qualitative data can, for
example, stand for a user's opinion about a generated recommendation. In [Lewis, 1995], questionnaires
are proposed which can be used to measure the usability of a recommender system from the user's
perspective. Qualitative data is hard to obtain but necessary to assess the real value of a recommender
system. For instance, a user may not always be satisfied with highly accurate recommendations, e.g.,
when recommending the movie “Terminator II” to a user who already watched the first part of the
movie. The probability is high that the user is already aware of the recommended movie. Therefore,
this recommendation would be highly accurate but not very useful, that is, accuracy does not always
correlate with user satisfaction. See, for example, [McNee et al., 2006] or [Cremonesi et al., 2011] for
a broader discussion of this problem. User studies have shown to be a helpful tool for interpreting
the quantitative results such as a recommendation list returned by a recommender system. Note that
user studies represent the only experiment setting where qualitative data can be collected [Shani and
Gunawardana, 2011].
According to [Greenwald, 1976] there are principally two design types of user studies: between-subjects
and within-subjects experiments. In a user study typically two or more treatments (systems, algorithms,
 
Search WWH ::




Custom Search