Database Reference
In-Depth Information
and we obtain
0
@
1
A :
1
3
2
3
y ¼
In words, if a user in state 1 dwells on it with 40 % probability and switches to
state 2 with 60 % probability and, conversely, remains in state 2 with 70 % proba-
bility and moves to state 1 with 30 % probability, he/she will - in a session of infinite
length - be found in state 1 in 33 % and in state 2 in 66 % of the trajectory.
3.9.4 On the Convergence and Implementation
of RL Methods
In this section, we will investigate the convergence of the two most important
methods that we have introduced so far: the policy iteration from Sect. 3.6 and the
temporal-difference learning from Sect. 3.8 . So the question is: do these methods
converge against the optimal policy and under which assumptions? The question of
the convergence speed will not be addressed here.
The question of convergence is far more than of theoretical interest only.
Because in the end for the solution of our practical problems, that is, computation
of recommendations, we want to be sure that a solution even exists.
We start with the policy iteration. In Sect. 3.5 , we mentioned that by selecting a
policy
for the Markov decision process, we obtain a Markov chain. Indeed, by
determining the decision in form of the policy, there remains nothing more to
decide in the decision process but only to model the Markov chain.
The Bellman equation ( 3.7 ) can be written compactly in vector notation as
π
v π ¼ r π þ γ
P π v π ,
ð 3
:
15 Þ
where the vectors of the state values v π and rewards r π are defined as follows:
T ,
v π ¼ v π sðÞv π sðÞ...
v π sðÞ
½
T ,
r π ¼ r π
sðÞr π
r π sðÞ
½
sðÞ...
ð 3
:
16 Þ
ðÞ¼ X
ð X
s 0
r π
p ss 0 r ss 0
a π
s
;
and P π is the matrix of transition probabilities p ss 0 ( 3.3 ).
Similarly, the Bellman equation for the action-value function ( 3.6 ) can be
reformulated as
Search WWH ::




Custom Search