Closed-Loop Control Learning - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

Various algorithms have been suggested for applying the methodology of

temporal difference to various problems ion such areas as games, optimal

planning, and combinatorial optimization. Convergence of those algorithms

has been proved, and the following general framework of eligibility traces was

first described in [Bertsekas et al. 1996].

In that framework, k is an integer that labels the steps of the algorithm.

At step k , an initial state x 0 is chosen. That choice depends on the past

history of the algorithm implementation, and guarantees that each state is

visited frequently enough. Then, the current policy is applied for N instants

and a trajectory of length Nω k =( x 0 ,x 1 ,...,x k m ,...,x N ) generated. The

associated costs are observed and the temporal differences d k m are computed.

Then a finite sequence of positive state functions z k m is chosen. That se-

quence is called an eligibility trace; it has the following properties (index m

labels time along the trajectory; δ is the Kronecker delta function):

z 0 ( x )= δ x 0 ( x ), and, in addition, z k m ( x )=1when m is the first visiting

time of state x in trajectory ω k .

•

z k m +1 ( x )

z k m ( x )+ δ x k m +1

•

≤

( x ).

of state functions

that take their values in [0, 1]. That sequence is the sequence of gains or learn-

ing rates, and it obeys the classical assumptions of stochastic approximation

theory

In addition, let us consider a decreasing sequence

{

γ k }

•

γ k ( x )=

∞

.

k

γ k ( x ) 2 <

•

∞

.

k

Then, the generalized TD algorithm updates the estimation of the cost ac-

cording to

J k +1 ( x )= J k ( x )+ γ k ( x ) N− 1

z k m ( x ) d k m .

m =0

It converges almost surely towards J π . Therefore, the estimation is consistent.

For instance, TD( λ ) algorithm, which has been previously described, is a

particular eligibility trace algorithm where the trace decreases according to

the constant multiplicative rate αλ .

5.4.2.3 Back to Actor-Critics Methodology and Optimistic

Iteration of Policy

On many real-life problems, the assessment of a given policy may be a question

per se. The previous algorithms are designed for that purpose. However, we

introduced the valuation of a policy as a step in the computation loop aimed

Neural Networks: Methodology and Applications

Search WWH ::

Custom Search

Home