Database Reference
In-Depth Information
Chapter 11
What Cannot Be Measured Cannot Be
Controlled: Gauging Success with A/B Tests
Abstract The robust measurement of the efficiency of recommendation algorithms
is an extremely important factor in the development of recommendation engines. We
provide some useful methodical remarks on this topic in this chapter, even though it is
not directly connected to the problem of adaptive learning. We further propose a
straightforward algorithm to calculate confidence intervals for REs. At the end, we
discuss Simpson's paradox which illustrates the importance of constant environment
conditions for testing.
The use of A/B tests to assess the efficiency of recommendation algorithms is on
the increase. Here a proportion of all episodes (generally web sessions) is
randomly assigned to the recommendation algorithm group (referred to as the
“recommendation group”), and the remaining episodes serve as a control
group. Depending on the specific objectives, the control group may be empty
(i.e., displaying no recommendations) or may be assigned to a different recom-
mendation algorithm. In the group assignment of episodes, there is normally a
fixed ratio between the number of episodes in each group, e.g., 50:50 or 90:10. We
call this ratio the episode quotient q .
Along with the reward r , other relevant statistical characteristics can be
measured in each group. In the case of web shops, these could be the number of
clicks, shopping baskets, orders, purchased products, and in particular sales.
Multiplying these figures by the episode quotient then gives the percentage
efficiency of the recommendation algorithm as compared with the control group
for all indicators.
The use of A/B tests to determine recommendation quality is widely accepted
and meets generally recognized statistical and scientific standards. However, their
correct implementation and evaluation require compliance with certain criteria,
which we will look at more closely below.
Search WWH ::




Custom Search