Up the Down Staircase: Hierarchical Reinforcement Learning - Realtime Data Mining

Database Reference

In-Depth Information

b ¼ r π

ð 6

:

10 Þ

so that Eq. ( 3.15 ) takes the standard form ( 6.4 ):

Av ¼ b

:

Now Ziv investigated the use of classic iteration methods such as the Richardson

method for the solution of ( 6.4 ) with the operator ( 6.9 ) and the right-hand side

( 6.10 ). Using the eigenvalue properties of the transition probability matrix P π ,it

was established by this means that algebraic smooth error vectors do arise. As was

mentioned at the start, multilevel methods are outstandingly suitable for the solu-

tion of these problems. Of course everything we have described carries over to the

case of action-value functions ( 3.17 ).

Furthermore. in [Ziv04] and [Pap10], the use of multilevel methods is investi-

gated not only for the problems of dynamic programming (i.e., model based) but

also for model-free methods, in particular for temporal-difference learning. For the

latter, two multilevel methods are proposed, the first a multiplicative variant and the

second an additive variant. The multiplicative variant is less convincing, and it

cannot be fully proven to converge. In the following we will investigate the

multigrid method for the model-based case and then move to Ziv's additive

preconditioner for the model-free case.

6.2.1

Interpolation and Restriction Based on State

Aggregation

We consider a hierarchy with the levels 0, ... , L , where 0 is the finest grid which

represents the current state values from S and L the coarsest grid. An interpolator

I lþ1 from level lþ 1 to level l shall be given. For the restriction I lþ l of level lþ 1to

the level l in general, the transpose or pseudo-inverse of I lþ1 is used.

In [Ziv04] and [Pap10] the construction of the interpolator I lþ1 was considered

on an algebraic basis in the course of the algebraic multigrid. In particular the state

aggregation was considered according to [BC89] that we will also apply. For this

purpose the state space S is split into K disjoint groups G β ,

β ¼ 1,

...

, m . The

definition of the interpolator is

I lþ 1 iβ ¼

1,

G β

0, else

i

∈

ð 6 : 11 Þ

and the restrictor is defined as its pseudo-inverse according to Moore-Penrose:

h

i 1

I lþ 1 T I lþ 1

I lþ 1 T

I lþ 1

l

¼

:

ð 6

:

12 Þ

Realtime Data Mining

Search WWH ::

Custom Search

Home