The Big Picture: Toward a Synthesis of RL and Adaptive Tensor Factorization - Realtime Data Mining

Database Reference

In-Depth Information

10.4.2 Model-Free Computation in Virtue of TD ( λ )

with Function Approximation

We shall be interested in a model-free stochastic iteration scheme for computing an

approximation of the form ( 10.9 ) which does not rely on a factored representation

of the transition probabilities. Though of less practical significance, we shall first

attend to discounted problems without terminal state for the sake of simplicity.

Recall the function space framework ( 6.1 ). It is possible to consider the factor-

ization of the state-value function ( 10.9 ) as a linear function approximation in terms

of the basis

ϕ αβ : S ! R

ϕ αβ s ;

ðÞ¼u s β δ αs ¼ δ β ðÞ¼β , α¼s

θ αβ .

Extensions of the temporal-difference learning algorithms presented in Chap. 3

working with an approximate representation in terms of a linear architecture

v Φθ ¼ X

α ∈

, β ∈

m . In this view, the coefficient corresponding to

ϕ αβ is given by

for

j ϕ j θ j

are presented and discussed in [BT96, TVR97]. Further generalizations incorpo-

rating the multigrid framework introduced in Chap. 6 are extensively studied in

[Pap11, Ziv04, ZS05]. A model-free update rule, which we state in terms of the

state-value rather than the action-value function of the given policy for the sake of

simplicity, corresponding to ( 10.2 ) is given by

T zd

θ :¼ θ þ αΦ

Φð ,

ð 10

11 Þ

where d ( ), z are as stipulated in ( 10.3 ).

A mathematically inclined reader may be interested in the following convergence

result:

Theorem 10.1 [BT96, TVR97] Let

have linearly independent columns and

P ) 1 ( I γ

λ ∈

[0,1] . Furthermore, let A λ

: ¼ ( I γλ

P ) . Then, under the same

assumptions as for ordinary TD(

), the sequence of iterates generated by the update

rule ( 10.11 ) converges a.s. to

T DA λ Φ

T DA λ A 1

θ λ :¼ Φ

Moreover, it holds that

θ 1 :¼ argmin θ ||v-

Φθ

|| D , and

Φθ λ v

k D

1 λγ

1 γ

k D

Φθ 1 v

Þ 1 þ γ 2

γλ

where kk D denotes the norm corresponding to the inner product induced by the

multiplication operator of the steady-state probabilities of the Markov chain.

Realtime Data Mining

Search WWH ::

Custom Search

Home