Reinforcement Learning - Design of Experiments for Reinforcement Learning

Civil Engineering Reference

In-Depth Information

n> 1 and could extend across multiple episodes. After adding the momentum term,

Eqs. 2.7 and 2.8 become, respectively:

ʻ t − k f v ( k )

y ( k )

ʱ r ( t + 1)

V ( t )

w ( t )

ʷw ( t − 1)

ʳV ( t + 1)

−

(2.9)

ʻ t − k f v ( k )

w ( t )

jh f v ( k )

x ( k )

+ ʱ r ( t + 1)

− V ( t )

w ( t )

= ʷw ( t − 1)

+ ʳV ( t + 1)

k = 0

(2.10)

There are a few important notes about the above two equations. The error term,

the difference between subsequent state values V ( t + 1) and V ( t ) (neglecting the reward

term for now), is essentially the information (i.e., feedback) that is used to update

network weights. Drawing from terminology of the back-propagation algorithm,

V ( t + 1) can be considered the target output value y j , and V ( t ) can be considered the

predicted output value y j . The network error is therefore based on the next state value

V ( t + 1) , in spite of the fact that this value is merely an estimate and may actually be

quite different than the true state value. When weight updates become small, it is

possible that numerical errors could affect the final network weights. However, it is

likely that the error between the approximated value function by the neural network

and the true (unobserved) value function will dominate any numerical error, and thus

numerical error is not a large concern.

The general form of a sequential decision making processes proceeds over t =

0, 1, ... , T . For all intermediate (i.e., non-terminal, t = T ) time steps, the next-state

values V ( t + 1) are available based on their predicted value after pursuing an action;

an associated reward r ( t + 1) may also be provided at these time steps as well, and

the incorporation of rewards at intermediate time points is dependent on the specific

problem. For example, some problems have well-defined subgoals that must be

achieved en route to an ultimate goal. In such cases, non-zero reward values may be

provided when these subgoals are achieved. In other problems such as board games,

there are often no such subgoals, and thus the reward at all intermediate time steps

1 is 0, and the temporal difference error is then only based on the

difference between subsequent state values (i.e., ʳV ( t + 1)

0, 1, ... , T

−

V ( t ) ).

−

T , there is no next-state value V ( t + 1) , and this value

is set to 0. The reward at this time step r ( t + 1) is non-zero however, and the temporal

difference error is then based on the previous state value and the reward value (i.e.,

r ( t + 1)

At the terminal time step t

V ( t ) ). For problems in which there are no intermediate rewards, the reward

provided at time step T is the only information that is known with complete certainty,

and thus this is the only true information from which the neural network can learn

the values of all previously visited states.

Equations 2.9 and 2.10 require that information (terms within the summation)

from all previous states x (0) , ... , x ( t − 1) be used to determine the appropriate weight

adjustment at state x ( t ) . At time step t , information from all previous states is merely

discounted by ʻ . This can be exploited to reduce the number of computations required

−

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home