Reinforcement Learning - Design of Experiments for Reinforcement Learning

Civil Engineering Reference

In-Depth Information

state values at the current and subsequent time steps, respectively. More explicitly,

we can write x ( t + 1)

= ˀ ( x ( t ) ), which indicates that the state x ( t + 1) is a function

of the -greedy action selection policy ˀ , as is used in this work. More generally

though, any action selection policy ˀ may be used, in which case we could simply

write x ( t + 1)

ˀ ( x ( t ) ). An -greedy action selection policy chooses the action that

results in the next-state x ( t + 1) with the greatest value 100 % of the time (where

ranges over [0, 1]). In the other 100(1

) % of the time, random actions are chosen

regardless of the values of the next-states. Thus, this policy allows for the agent

to exploit its knowledge 100 % of the time, but also to explore potentially better

actions 100(1

−

) % of the time.

The values of V ( t ) and V ( t + 1) are determined by evaluating the respective state

values ( x ( t ) and x ( t + 1) ) through the neural network using forward propagation, where

the next-state vector x ( t + 1) is determined based on an action selection procedure and

the dynamics of the domain. This expression for the temporal difference error also

discounts the subsequent state value V ( t + 1) by a factor ʳ , which serves to attenuate

the value that the network is attempting to learn. Note that the temporal difference

algorithm gets its name from this error expression, as it is based on the difference in

the predicted state values at two different time steps.

The general form of the TD( ʻ ) algorithm can be more explicitly written for up-

dating the network weights such that w

−

ₐ

w . The weight updates between

nodes in the output layer j and nodes in the hidden layer h ( w ( t )

jh ) at time step t can

be stated as:

ʻ t − k f v ( k )

y ( k )

ʱ r ( t + 1)

V ( t )

w ( t )

ʳV ( t + 1)

jh =

−

(2.7)

k = 0

where f ( v ( k j ) is the derivative of the transfer function at node j evaluated at the

induced local field v ( k j at time step k . Equation 2.7 can then be extended to updating

the weights between nodes in the hidden layer h and nodes in the input layer i ( w ( t )

hi )

at time step t as (again with a single output node):

ʱ r ( t + 1)

V ( t )

ʻ t − k f v ( k )

w ( t )

jh f v ( k )

x ( k )

w ( t )

ʳV ( t + 1)

−

(2.8)

k =

A basic implementation of the TD( ʻ ) algorithm requires only the use of Eqs. 2.7

and 2.8 . Extending these equations using some relatively simple techniques however,

can significantly reduce the computational cost in terms of both time and space, and

this can improve the efficiency of the learning algorithm.

The use of a momentum term with coefficient ʷ can be added to Eqs. 2.7 and 2.8

in order to incorporate a portion of the weight update from the previous time step

1 into that for the current time step t . This has the effect of smoothing out the

network weight changes between time steps and is often most effective when training

the network in a batch manner. Batch training is where weight updates are computed

during every time step, but the updates are only applied after every n time steps where

−

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home