Database Reference
In-Depth Information
10.4.2 Model-Free Computation in Virtue of TD ( λ )
with Function Approximation
We shall be interested in a model-free stochastic iteration scheme for computing an
approximation of the form ( 10.9 ) which does not rely on a factored representation
of the transition probabilities. Though of less practical significance, we shall first
attend to discounted problems without terminal state for the sake of simplicity.
Recall the function space framework ( 6.1 ). It is possible to consider the factor-
ization of the state-value function ( 10.9 ) as a linear function approximation in terms
of the basis
ϕ αβ : S ! R
,
ϕ αβ s ;
ðÞ¼u s β δ αs ¼ δ β ðÞ¼β , α¼s
θ αβ .
Extensions of the temporal-difference learning algorithms presented in Chap. 3
working with an approximate representation in terms of a linear architecture
v Φθ ¼ X
α
S
, β
m . In this view, the coefficient corresponding to
ϕ αβ is given by
for
j ϕ j θ j
are presented and discussed in [BT96, TVR97]. Further generalizations incorpo-
rating the multigrid framework introduced in Chap. 6 are extensively studied in
[Pap11, Ziv04, ZS05]. A model-free update rule, which we state in terms of the
state-value rather than the action-value function of the given policy for the sake of
simplicity, corresponding to ( 10.2 ) is given by
T zd
θ :¼ θ þ αΦ
Φð ,
ð 10
:
11 Þ
where d ( ), z are as stipulated in ( 10.3 ).
A mathematically inclined reader may be interested in the following convergence
result:
Theorem 10.1 [BT96, TVR97] Let
Φ
have linearly independent columns and
P ) 1 ( I γ
λ
[0,1] . Furthermore, let A λ
: ¼ ( I γλ
P ) . Then, under the same
assumptions as for ordinary TD(
λ
), the sequence of iterates generated by the update
rule ( 10.11 ) converges a.s. to
1
T DA λ Φ
T DA λ A 1
θ λ :¼ Φ
Φ
Φ
b
:
0
Moreover, it holds that
θ 1 argmin θ ||v-
Φθ
|| D , and
k
Φθ λ v
k D
1 λγ
1 γ
k D
p
k
Φθ 1 v
ð
Þ 1 þ γ 2
ð
γλ
Þ
where kk D denotes the norm corresponding to the inner product induced by the
multiplication operator of the steady-state probabilities of the Markov chain.
Search WWH ::




Custom Search