Recommendations as a Game: Reinforcement Learning for Recommendation Engines - Realtime Data Mining

Database Reference

In-Depth Information

value of the product s (price, revenue, etc.) as its reward; otherwise, it receives a

small click reward, close to 0. This reflects the primary goal of seeking to maximize

the shopping basket values or the sales/revenue. Note that orders constitute a

delayed reward, since, in most cases, they appear only at the end of a session.

The definition of the correct reward is linked to various refinements that will not be

further explored here.

We now come to the statistical characteristics. Let us state our first fundamental

assumption:

Assumption 4.1 (Markov property for REs): In every state s, the optimal

action a, i.e., the best recommendation, depends solely on the current state s ,

i.e., the product under consideration.

Of course, this Markov property for REs is satisfied only incompletely, since the

best recommendation also depends on the preceding states of s together with their

transactions. Nevertheless, for the evaluation of a recommendation by the user, the

product currently viewed plays the main role, so the assumption may be considered

reasonable. (There is also compelling empirical evidence on this point, namely,

classic cross-selling, which is described using precisely this form of rules and

whose effectiveness is beyond doubt.)

As a further simplification, let us assume that the reward in the state transition

from s to s 0 is independent of the influence of the action a :

Assumption 4.2 (Reward property for REs): For each state transition from

s to s 0 , the obtained reward r ss 0

is independent of the action a .

This means that

r ss 0 ¼ r ss 0 :

ð 4

:

2 Þ

In fact, it can be assumed that the user's decision as to whether or not to place a

product in the shopping basket depends primarily on the product itself and not on

the preceding recommendation. Thus, the estimated reward can technically be

validly saved as a characteristic of the rule s ! s 0 .

Theaction-valuefunction q(s,a) assigns the expected return, i.e., the expected

sales over the remainder of the session, to each product s and to each of its

recommendations a . Technically, q(s,a) can thus also be represented by the rule

s ! s a from product s to the recommended product s a .

There remains the question of the transition probabilities p ss 0 . This is a compli-

cated subject, which we shall consider in depth in Chap. 5 .

Search WWH ::

Custom Search

Home