Information Technology Reference
In-Depth Information
In its adaptive version, a key change is performed:
Theupdateof Q n is performed from new experimental or simulation re-
sults according to an exploration policy of the feasible state-action couples.
Thus a new algorithm is available. Its definition relies on a random exploration
policy π , which associates to each state x a probability on the set of actions u
such that the state-action couple ( x,u ) is feasible. Thus, the exploration policy
generates a Markov transition on the set of feasible (state-action) couples.
Actually, the theoretically mild assumption that each feasible couple is visited
infinitely often su ces to guarantee the convergence of the algorithm. We
will discuss below the practical consequences of the choice of the exploration
policy. Each transition according to the exploration policy is followed by the
implementation of the following Q-learning update rule:
Q k +1 ( x,u )= Q k ( x,u )if( x,u )
=( x k k +1 ( x k ))
Q k +1 ( x k k +1 ( x k )) = (1
γ k +1 ) Q k ( x k k +1 ( x k ))
+ γ k +1 [ c ( x k k +1 ( x k ) ,x k +1 )] + αJ k ( x k +1 )
J k +1 ( x k +1 )=
min
u/ ( x k +1 ,u ) A
Q k +1 ( x k +1 ,u )
Q-Learning Convergence Theorem. The Q-learning algorithm converges
to the value function Q that is associated to the optimal policy π once all the
feasible state-action couples are visited infinitely often and the sequence of the
learning rates decreases according to the basic assumption of approximation
theory (for instance linear decrease with respect to the number of visits).
When convergence is stopped, providing a satisfactory estimation of the
value function Q , it is obvious to determine the associated policy as the
solution of the minimization problem
π ( x ) = Arg min
u/ ( x,u ) A
Q ( x,u )
exactly as in the value function iteration algorithm. There is no necessary
connection between the exploration policy and the optimal policy. Of course,
a blind exploration policy is rather costly. In practice, one tries to track an
optimal policy, by applying a sub-optimal policy. We will elaborate on that
point in the following section.
5.4.3.2 The Choice of an Exploration Policy
The problem of choosing an exploration policy is commonly addressed in se-
quential statistics [Thrun 1992]. If a lot of time is dedicated to the exploration,
using for instance a blind policy (random selection of a feasible action), the
resulting estimation is accurate but exploration is costly
in computational time if a simulator is used,
Search WWH ::




Custom Search