Information Technology Reference
In-Depth Information
at finding the optimal policy. The valuation algorithms are iterative so that
running them as a part of an optimization loop may too time-consuming.
Therefore, it is natural to improve the current policy before completing the
valuation process. That improvement makes use of partial results that are
provided by one or a small number of iterations of the valuation algorithm.
In the subsection “Value-function Iteration method” of the previous sec-
tion, we outlined a control design algorithm whereby the policy was updated
according to an intermediate approximation of the cost function. That method
is called “ Actor-Critics ”methodor“ Optimistic Iteration of the Policy ”, be-
cause the current policy is computed on the basis of the current estimation of
the cost function that is optimistically supposed to be the optimal cost.
More precisely, as described above, the following steps are implemented at
iteration n , given the cost function J n .
Q n is computed as the associated value function of J n ,
Q n ( x,u )=
y∈E
p u ( x,y )[ c ( x,u,y )+ αJ n ( y ) ] .
The policy π n is defined as the solution of the minimization problem,
π n ( x ) =Argmin
u/ ( x,u ) A
Q n ( x,u ) .
One or several iterations of a temporal difference valuation algorithm are
performed, using simulation results or experimental measurements, with
policy π n as the current exploration policy. Thus, a new approximation
J n +1 is obtained.
5.4.3 Reinforcement Learning: Q-Learning Method
5.4.3.1 Description of the Q-Learning Algorithm
The algorithmic variants, which have just been stated above, use the value
function Q as a key ingredient in the determination of the optimal policy.
Watkins and Dayan suggested an adaptive version of the value function it-
eration method in [Watkins et al. 1992]. It was called “ Q-learning ”bythe
authors, since it focused on the learning of the value function Q . It quickly
became one of the most popular reinforcement learning algorithms, especially
for infinite horizon problems.
The previous value function iteration algorithm consisted in the following:
Q n was computed as the value function for the cost function J n
J n +1 ( x )=min u/ ( x,u ) A Q n ( x,u ).
Search WWH ::




Custom Search