What Cannot Be Measured Cannot Be Controlled: Gauging Success with A/B Tests - Realtime Data Mining

Database Reference

In-Depth Information

Chapter 11

What Cannot Be Measured Cannot Be

Controlled: Gauging Success with A/B Tests

Abstract The robust measurement of the efficiency of recommendation algorithms

is an extremely important factor in the development of recommendation engines. We

provide some useful methodical remarks on this topic in this chapter, even though it is

not directly connected to the problem of adaptive learning. We further propose a

straightforward algorithm to calculate confidence intervals for REs. At the end, we

discuss Simpson's paradox which illustrates the importance of constant environment

conditions for testing.

The use of A/B tests to assess the efficiency of recommendation algorithms is on

the increase. Here a proportion of all episodes (generally web sessions) is

randomly assigned to the recommendation algorithm group (referred to as the

“recommendation group”), and the remaining episodes serve as a control

group. Depending on the specific objectives, the control group may be empty

(i.e., displaying no recommendations) or may be assigned to a different recom-

mendation algorithm. In the group assignment of episodes, there is normally a

fixed ratio between the number of episodes in each group, e.g., 50:50 or 90:10. We

call this ratio the episode quotient q .

Along with the reward r , other relevant statistical characteristics can be

measured in each group. In the case of web shops, these could be the number of

clicks, shopping baskets, orders, purchased products, and in particular sales.

Multiplying these figures by the episode quotient then gives the percentage

efficiency of the recommendation algorithm as compared with the control group

for all indicators.

The use of A/B tests to determine recommendation quality is widely accepted

and meets generally recognized statistical and scientific standards. However, their

correct implementation and evaluation require compliance with certain criteria,

which we will look at more closely below.

Search WWH ::

Custom Search

Home