Civil Engineering Reference
In-Depth Information
field that make slight modifications to temporal difference methods or that use a
different or extended conceptual model of learning. The reader is directed to Sutton
and Barto ( 1998 ), Szepesvári ( 2010 ), Powell and Ma ( 2011 ), and Dann et al. ( 2014 )
for a comprehensive review of many algorithms used for reinforcement learning.
2.2.3.1
Policy Evaluation Approaches
The temporal difference learning approach is a general concept that is used for
learning how to perform in sequential decision making problems. Here, we describe
two possible approaches to using TD( ʻ ) that are conceptually different in what is
actually being learned.
The simplest use of TD( ʻ ) is exactly what we have previously described. The
neural network would take in a state vector, and it would output a single value.
That is, the input to the neural network is a state vector of the current state x ( t ) ,
and the output of the network V ( t ) is the value of being in that state. Similarly, the
value of being in state x ( t + 1) can be evaluated by inputting x ( t + 1) to the network and
obtaining a value V ( t + 1) . The action selection policy requires all next-states to be
evaluated through the network individually, and then selecting one action from the
set of possible actions using, for example, and -greedy action selection procedure.
For simple problems where the state transitions are deterministic and known, and
thus the next-states are known, such as in Gridworld, this approach can be used and
it works.
However, in more complex problems, especially control-type problems or with
continuous dynamics or some form of randomness in the environment, it is not
always possible to know what the next-states are even if it is known what actions can
be performed. Consequently, one cannot simply evaluate the value of each action
because the next-states are unknown. An alternative formulation is to use a neural
network with multiple output nodes. Each of the output nodes represent one action j
out of the set possible actions, and the value at the output nodes V ( x ( t ) , a ( t j ) represents
the value of taking a specific action j when in the state x ( t ) that is input to the network.
The output of the network can then be considered a state-action value, or the value
of pursuing action j when in state x ( t ) . In a single pass of the network, one can
compute all of the state-action values for all actions. The selection of a single action
can be just as in the case where the network has a single output node; that is, using
an -greedy action selection procedure perhaps based on the state-action values of
all possible actions.
The weight updates for state-action value learning are slightly different. Rather
than having a single error term (for a single network output), we have an error vector
where each value corresponds to one of the possible actions. Furthermore, only the
output node j corresponding to the action that was selected has a non-zero error
term, which is:
ʳV x ( t + 1) , a ( t + 1)
V x ( t ) , a ( t )
E ( t )
r ( t + 1)
=
+
Search WWH ::




Custom Search