Information Technology Reference
In-Depth Information
in experimental costs if experiments are performed (destructive testing is
not a good solution).
However, if an optimistic exploration policy is performed, which relies on the
current estimation of the cost function, the estimation of the value function
may be strongly biased. Thus, neither extreme strategies, blind exploration
policy or greedy (fully optimistic) policy, are relevant.
Several mixed exploration schemes have been successfully tested. All of
them are intertwining phases of greedy policy most of the time and phases
of exploration policy which allow to explore new state-action couples and to
check the convergence hypothesis of stochastic approximation (see above):
The iterative exploration-optimization scheme alternates iteration sequen-
ces of greedy optimistic policy and iteration sequences of blind exploration
policy.
According to the randomized scheme, a random selection of the policy
is performed independently for each step (blind with probability ε and
greedy with probability 1 −ε ).
The annealed scheme is inspired from simulated annealing in combinatorial
optimization (which is described in detail in Chap. 8). According to that
scheme, a random policy is performed subject to the Gibbs distribution,
exp
Q k ( x k ,u )
T k
exp
,
P ( π k ( x k )= u )=
Q k ( x k ,u )
T k
u/ ( x k ,u )
A
where the temperature sequence
obeys a cooling schedule. The cool-
ing schedule must be tuned according to the problem of interest. Several
cooling schedules are described in Chap. 8.
{
T k }
5.4.3.3 Application of Q-Learning to Partially Observed Problems
It is easy to implement Q-learning algorithm for partially observed Markov
decision problems if the selection is performed among feasible policies that
depend solely on observations and that do not depend on an unknown state.
It is currently implemented, especially in autonomous robotics. In that field,
perception is limited; it is provided by sensors and does not allow determining
exactly the current state. Q-learning may provide satisfactory sub-optimal
policies but the success is not guaranteed. In that case, Q-learning is just
a heuristic. Its success relies on the relevance of the sensors to determine
the key features of the environment with respect to feasible optimal policy
whenever it exists (as a deterministic policy). It has been shown [Singh et al.
1995] that the limit of Q-learning algorithms (which stabilizes under common
mild hypothesis) depends on the exploration policy, and is generally not the
 
Search WWH ::




Custom Search