Information Technology Reference
In-Depth Information
denoted by α . In that framework, the algorithm that is described below is
the most commonly used. The method can be used as well in finite horizon
problems or in shortest stochastic path problems.
5.4.2 TD Algorithm of Policy Evaluation
5.4.2.1 TD(1) Algorithm and Temporal Difference Definition
The “temporal difference” method (TD method in abbreviated form) is based
on the following additivity equation of cost
J π ( x )= E P π,x [ c ( x,π ( x ) ,X 1 )+ αJ π ( X 1 )] ,
which was stated in the previous section, and which is rewritten in a simpler
form without superscripts.
When we implement a transition ( x → y ) according to the admissible
state-action couple, the corresponding cost c ( x,π ( x ) ,y ) must be used for up-
dating the estimation of J π ( x ). That update is performed by the recursive
computation of the average using a filtering technique with gain (or learning
rate) γ . Thus, we take into account the new information about the average
total cost c ( x,π ( x ) ,y )+ α J π ( y ) and the previous information J π ( x ) according
to the following relation:
J +
π
( x )= J π ( x )+ γ [ c ( x,π ( x ) ,y )+ α J π ( y )
J π ( x )] .
The properties of decreasing gain and constant gain filtering techniques were
reviewed in Chap. 4. If the gain decreases linearly with the number of updates,
the filter will (slowly) converge to the desired value (consistent estimation). If a
small, constant gain is used in the stationary regime, the filter undergoes small
fluctuations around the desired value. However, the filter is able (subject to
an appropriate tuning of the gain) to track slow variations of the environment.
For practical use, one generally implements first a decreasing gain to get close
to the stationary regime, and then a small constant gain filter to track the
stationary regime. Note that the gain tuning may be specific here to each
state.
Yet, updates of values of J π for different states are coupled. Actually, the
update of J π after a transition ( x → y ) will use the previous update of J π ( y ).
That method is called a “temporal difference method” (TD method) and is
extended to a trajectory of length N .
Given a policy π , a current estimation J π of J π and a state trajectory
denoted ( x 0 ,...,x N ), whose initial state is x 0 , having N transitions, which is
obtained by the application of the policy, the temporal difference of order k
is the quantity d k , which is defined by the following:
d k = c ( x k ( x k ) ,x k +1 )+ α J π ( x k +1 )
J π ( x k ) .
Search WWH ::




Custom Search