Closed-Loop Control Learning - Neural Networks: Methodology and Applications

Information Technology Reference

In-Depth Information

at finding the optimal policy. The valuation algorithms are iterative so that

running them as a part of an optimization loop may too time-consuming.

Therefore, it is natural to improve the current policy before completing the

valuation process. That improvement makes use of partial results that are

provided by one or a small number of iterations of the valuation algorithm.

In the subsection “Value-function Iteration method” of the previous sec-

tion, we outlined a control design algorithm whereby the policy was updated

according to an intermediate approximation of the cost function. That method

is called “ Actor-Critics ”methodor“ Optimistic Iteration of the Policy ”, be-

cause the current policy is computed on the basis of the current estimation of

the cost function that is optimistically supposed to be the optimal cost.

More precisely, as described above, the following steps are implemented at

iteration n , given the cost function J n .

Q n is computed as the associated value function of J n ,

Q n ( x,u )=

y∈E

p u ( x,y )[ c ( x,u,y )+ αJ n ( y ) ] .

The policy π n is defined as the solution of the minimization problem,

π n ( x ) =Argmin

u/ ( x,u ) ∈ A

Q n ( x,u ) .

One or several iterations of a temporal difference valuation algorithm are

performed, using simulation results or experimental measurements, with

policy π n as the current exploration policy. Thus, a new approximation

J n +1 is obtained.

5.4.3 Reinforcement Learning: Q-Learning Method

5.4.3.1 Description of the Q-Learning Algorithm

The algorithmic variants, which have just been stated above, use the value

function Q as a key ingredient in the determination of the optimal policy.

Watkins and Dayan suggested an adaptive version of the value function it-

eration method in [Watkins et al. 1992]. It was called “ Q-learning ”bythe

authors, since it focused on the learning of the value function Q . It quickly

became one of the most popular reinforcement learning algorithms, especially

for infinite horizon problems.

The previous value function iteration algorithm consisted in the following:

•

Q n was computed as the value function for the cost function J n

•

J n +1 ( x )=min u/ ( x,u ) ∈ A Q n ( x,u ).

Search WWH ::

Custom Search

Home