Database Reference
In-Depth Information
probability
p
ss
a
, the higher the conditional probability
p
ss
a
. The transition probabil-
ities
p
ss
0
for all other products
s
0
are, conversely, influenced negatively by delivering
a
, since (
3.2
) applies. Of course, our probability property is of a somewhat abstract
nature, since, because of the equation system being strongly overdetermined,
c
and
d
cannot be uniquely determined in general. Nevertheless, it is helpful for qualita-
tive discussion.
Thus (
3.5
) takes the following form:
ðÞ¼p
ss
a
r
ss
a
þ
X
s
0
6¼s
a
p
ss
0
r
ss
0
¼ dp
ss
a
r
ss
a
þ
X
s
0
6¼s
a
q
π
s
;
cp
ss
0
r
ss
0
and yields
q
π
s
q
π
s
ðÞ
;
ðÞ¼ d c
;
ð
Þ p
ss
a
r
ss
a
p
ss
b
r
ss
b
>
0
, p
ss
a
r
ss
a
>
p
ss
b
r
ss
b
:
The formula for calculating the action value can be derived immediately from
this:
q
P
ðÞ¼p
ss
a
r
ss
a
,
s
;
ð
5
:
1
Þ
which we will refer to as the (simplified
) P-Version
below. A recommendation is
thus strong if it is either frequently clicked on, or carries a high reward, or both.
Approach (
5.1
) may now be expanded for case
γ >
0 in accordance with (
3.6
),
whereupon we obtain the full P-Version:
q
P
:
p
ss
a
X
a
0
q
P
a
0
a
0
ðÞ¼p
ss
a
r
ss
a
þ γ
s
;
π
s
a
;
s
a
;
ð
5
:
2
Þ
calculate (
5.1
) and (
5.2
) either in an off-line fashion or (
5.1
) directly online or (
5.2
)
online using ADP methods like Algorithm 3.3.
Alternatively, for the model-free case, we can very easily apply the TD-Version
in a similar way, although we have to employ a few empirical tricks to overcome
the problem of multiple recommendations. In practice, the unconditional approach
works quite successfully; the P-Version works better than the TD-Version.
Example 5.1
Subsequently, we shall illustrate the results of the unconditional
approach by means of a practical example. Here, we shall employ the online
verification methods described in Sect.
4.4
. We forgo the chain property, i.e., we
assign
γ ¼
0. Thus, we use the simple P-Version according to (
5.1
) with an
adaptive update of the transition probabilities
p
ss
a
and rewards
r
ss
a
. To observe
unbiased user behavior, only transactions of sessions belonging to the control group
have been included in the analysis.