Preliminaries - Recommender Systems and the Social Web

Databases Reference

In-Depth Information

list to the total number of all like statements, formally recall is measured as

recall = |

PLS

∩

ELS

(2.13)

ELS

and describes how many of the existing like statements were found by the recommender.

In the evaluation procedure, recommendations and the corresponding precision and recall values are

calculated for all users in the data set and then averaged. Note that both precision and recall values have

to be considered simultaneously because improving one is usually at the cost of the other. Therefore, the

averaged precision and recall values are combined in the F1-score, where

recall

precision + recall

precision

∗

F 1=2

∗

(2.14)

2.2.2 User studies

As opposed to oine experiments, user studies require much more user effort and time [Wildemuth, 2003].

In a user study typically a small group of volunteer test persons are recruited for the experiment. The

test persons are expected to interact with a recommender system and to report their experience with

the system. The experiment is either conducted in a laboratory environment or in some other locations,

e.g., in private locations. The participants are monitored and interviewed either before, during, or after

the experiment. Usually, the results from a user study are used to test some hypotheses which have

been formulated by the researcher before the experiment was actually designed. When conducting a user

study the question for the appropriate number of users to be recruited for the experiment arises. We can

consider the number of recruited users for an experiment large enough when the results are statistically

significant according to a statistical significance test [Demsar, 2006; Shani and Gunawardana, 2011]. It

is important to know that the statistical significance test has to be selected carefully because some of

the tests are either not strong enough to detect existing significant differences or may even lead to false

detections of significance in the data where there is no significance at all, see, for example, [Demsar, 2006]

and [Smucker et al., 2007] for a comparison of statistical significance tests.

User studies are more costly than simple o ine experiments and harder to conduct, but amongst the

three different evaluation types discussed here, they can perhaps answer the widest range of question

types according to [Shani and Gunawardana, 2011]. In a user study both types of results - quantitative

and qualitative results - can be collected. Quantitative results can be recorded by monitoring the user

behavior. A quantitative result would be, for instance, the time needed by a user to complete a task

which can be measured implicitly. In [Gedikli et al., 2011b], for example, we measure how different

explanation interfaces can reduce the user's decision-making time (see also Chapters 5 and 6). We use

a direct measurement and compute the time difference for completing the same task of decision making

with and without an explanation facility or across different explanation facilities.

On the other hand, qualitative data can also be collected in a user study, e.g., by explicitly asking

questions to the users related to their feelings towards the recommender system. Qualitative data can, for

example, stand for a user's opinion about a generated recommendation. In [Lewis, 1995], questionnaires

are proposed which can be used to measure the usability of a recommender system from the user's

perspective. Qualitative data is hard to obtain but necessary to assess the real value of a recommender

system. For instance, a user may not always be satisfied with highly accurate recommendations, e.g.,

when recommending the movie “Terminator II” to a user who already watched the first part of the

movie. The probability is high that the user is already aware of the recommended movie. Therefore,

this recommendation would be highly accurate but not very useful, that is, accuracy does not always

correlate with user satisfaction. See, for example, [McNee et al., 2006] or [Cremonesi et al., 2011] for

a broader discussion of this problem. User studies have shown to be a helpful tool for interpreting

the quantitative results such as a recommendation list returned by a recommender system. Note that

user studies represent the only experiment setting where qualitative data can be collected [Shani and

Gunawardana, 2011].

According to [Greenwald, 1976] there are principally two design types of user studies: between-subjects

and within-subjects experiments. In a user study typically two or more treatments (systems, algorithms,

Recommender Systems and the Social Web

Search WWH ::

Custom Search

Home