Closed-Loop Control Learning - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

denoted by α . In that framework, the algorithm that is described below is

the most commonly used. The method can be used as well in finite horizon

problems or in shortest stochastic path problems.

5.4.2 TD Algorithm of Policy Evaluation

5.4.2.1 TD(1) Algorithm and Temporal Difference Definition

The “temporal difference” method (TD method in abbreviated form) is based

on the following additivity equation of cost

J π ( x )= E P π,x [ c ( x,π ( x ) ,X 1 )+ αJ π ( X 1 )] ,

which was stated in the previous section, and which is rewritten in a simpler

form without superscripts.

When we implement a transition ( x → y ) according to the admissible

state-action couple, the corresponding cost c ( x,π ( x ) ,y ) must be used for up-

dating the estimation of J π ( x ). That update is performed by the recursive

computation of the average using a filtering technique with gain (or learning

rate) γ . Thus, we take into account the new information about the average

total cost c ( x,π ( x ) ,y )+ α J π ( y ) and the previous information J π ( x ) according

to the following relation:

J +

π

( x )= J π ( x )+ γ [ c ( x,π ( x ) ,y )+ α J π ( y )

J π ( x )] .

−

The properties of decreasing gain and constant gain filtering techniques were

reviewed in Chap. 4. If the gain decreases linearly with the number of updates,

the filter will (slowly) converge to the desired value (consistent estimation). If a

small, constant gain is used in the stationary regime, the filter undergoes small

fluctuations around the desired value. However, the filter is able (subject to

an appropriate tuning of the gain) to track slow variations of the environment.

For practical use, one generally implements first a decreasing gain to get close

to the stationary regime, and then a small constant gain filter to track the

stationary regime. Note that the gain tuning may be specific here to each

state.

Yet, updates of values of J π for different states are coupled. Actually, the

update of J π after a transition ( x → y ) will use the previous update of J π ( y ).

That method is called a “temporal difference method” (TD method) and is

extended to a trajectory of length N .

Given a policy π , a current estimation J π of J π and a state trajectory

denoted ( x 0 ,...,x N ), whose initial state is x 0 , having N transitions, which is

obtained by the application of the policy, the temporal difference of order k

is the quantity d k , which is defined by the following:

d k = c ( x k ,π ( x k ) ,x k +1 )+ α J π ( x k +1 )

J π ( x k ) .

−

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home