Reinforcement Learning - Design of Experiments for Reinforcement Learning

Civil Engineering Reference

In-Depth Information

field that make slight modifications to temporal difference methods or that use a

different or extended conceptual model of learning. The reader is directed to Sutton

and Barto ( 1998 ), Szepesvári ( 2010 ), Powell and Ma ( 2011 ), and Dann et al. ( 2014 )

for a comprehensive review of many algorithms used for reinforcement learning.

2.2.3.1

Policy Evaluation Approaches

The temporal difference learning approach is a general concept that is used for

learning how to perform in sequential decision making problems. Here, we describe

two possible approaches to using TD( ʻ ) that are conceptually different in what is

actually being learned.

The simplest use of TD( ʻ ) is exactly what we have previously described. The

neural network would take in a state vector, and it would output a single value.

That is, the input to the neural network is a state vector of the current state x ( t ) ,

and the output of the network V ( t ) is the value of being in that state. Similarly, the

value of being in state x ( t + 1) can be evaluated by inputting x ( t + 1) to the network and

obtaining a value V ( t + 1) . The action selection policy requires all next-states to be

evaluated through the network individually, and then selecting one action from the

set of possible actions using, for example, and -greedy action selection procedure.

For simple problems where the state transitions are deterministic and known, and

thus the next-states are known, such as in Gridworld, this approach can be used and

it works.

However, in more complex problems, especially control-type problems or with

continuous dynamics or some form of randomness in the environment, it is not

always possible to know what the next-states are even if it is known what actions can

be performed. Consequently, one cannot simply evaluate the value of each action

because the next-states are unknown. An alternative formulation is to use a neural

network with multiple output nodes. Each of the output nodes represent one action j

out of the set possible actions, and the value at the output nodes V ( x ( t ) , a ( t j ) represents

the value of taking a specific action j when in the state x ( t ) that is input to the network.

The output of the network can then be considered a state-action value, or the value

of pursuing action j when in state x ( t ) . In a single pass of the network, one can

compute all of the state-action values for all actions. The selection of a single action

can be just as in the case where the network has a single output node; that is, using

an -greedy action selection procedure perhaps based on the state-action values of

all possible actions.

The weight updates for state-action value learning are slightly different. Rather

than having a single error term (for a single network output), we have an error vector

where each value corresponds to one of the possible actions. Furthermore, only the

output node j corresponding to the action that was selected has a non-zero error

term, which is:

ʳV x ( t + 1) , a ( t + 1)

V x ( t ) , a ( t )

E ( t )

r ( t + 1)

=

+

−

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home