Database Reference
In-Depth Information
Another important approach consists in linking policy evaluation and improve-
ment more strongly: instead of carrying out the entire policy evaluation to compute
the state-value function of a fixed policy exactly before executing the policy
improvement step, we carry out the latter in after each step of the iteration in the
policy evaluation. The resulting approach is thus more feasible (which does not
imply that it be faster or better) and is referred to as value iteration .
The value iteration is represented by Algorithm 3.2.
Algorithm 3.2: Value iteration
Input: transition probabilities P and -rewards R , discount rate
γ
, small inner
bound
0
Output: optimal state-value function v *
1: procedure VALUE_ITERATION( v , P , R ,
θ ∈ R, θ >
γ
,
θ
)
2:
repeat
3:
Δ
: ¼ 0
4:
for each s
S do
5:
v : ¼ v ( s )
v ( s ): ¼ max a s 0 p ss 0 [ r ss 0 +
v ( s 0 )]
6:
γ
7:
Δ
: ¼ max(
Δ
,| v v ( s )|)
8:
end for
Δ < θ
10: return V
11: end procedure
12: initialize v ( s ),
9:
until
S arbitrarily
13: v : ¼ VALUE_ITERATION( v, P , R ,
π
( s )
8 s
γ
,
θ
)
14: return v
Concerning the solution of the Bellman equation ( 3.15 ), the policy iteration may
be algebraically interpreted as additive and the value iteration as multiplicative
approach.
Finally, we will formally write down the Asynchronous DP algorithm. Since it
exists in different version, we consider the one that we use in our tests in Chap. 5 .
This version is a fully adaptive one where also the transition probabilities and
rewards are determined incrementally. Since it is an in-place method, it formally
rather represents a value iteration.
Algorithm 3.3: Adaptive ADP
Input: online rewards r and -transitions s 0 , discount rate
γ
, small inner bound
θ ∈ R, θ >
0
Output: optimal state-value function v *
1: initialize v ( s ),
( s ), p ss 0 , r ss 0
8 s , s 0
π
S , a
A ( s )arbitrarily
2: repeat for each episode
(continued)
Search WWH ::




Custom Search