Information Technology Reference
In-Depth Information
For model-free RL, this means to model the action-value function estima-
tes by probability distributions for each state/action pair. Unfortunately, this
approach is not analytically tractable, as the distributions are strongly correla-
ted due to the state transitions. This leads to complex posterior distributions
that cannot be expressed analytically. A workaround is to use various assump-
tions and approximations that make the method less accurate but analytically
and computationally tractable. This workaround was used to develop Bayesian
Q-Learning [68] that, amongst other things, assumes the independence of all
action-value function estimates, and uses an action selection scheme that maxi-
mises the information gain. Its performance increase when compared to methods
based on confidence intervals is noticeable but moderate.
Bayesian model-based RL is more popular as it provides cleaner implementa-
tions. It is based on modelling the transition and reward function estimates by
probability distributions that are updated with new information. This results in
a problem that can be cast as a POMDP, and can be solved with the same me-
thods [84]. Unfortunately, this implies that it comes with the same complexity,
which makes it unsuitable for application to large problems. Nonetheless, some
implementations have been devices (for example, [185]), and research in Bayesian
RL is still very active. It is to hope that its complexity can be reduced by the
use of approximation, but without losing too much accuracy and maintaining
full distributions that are the advantage of the Bayesian approach.
So far, the only form of Bayesian RL that has been used with LCS is Baye-
sian Q-Learning by using Bayesian classifier models within a standard XCS(F),
with the result of more effective and stable action selection, when compared to
XCS(F) [1]. This approach could be extended to use the full Bayesian model
that was introduced here, once an incremental implementation is available. The
use of model-based Bayesian RL requires anticipatory LCS, but is immediate
contribution is questionable due to the high complexity of the RL method itself.
9.6
Summary
Despite sequential decision tasks being the prime motivator for LCS,theyare
still the ones which LCS handle least successfully. This chapter provides a pri-
mer on how to use dynamic programming and reinforcement learning to handle
such tasks, and on how LCS can be combined with either approach from first
principles. Also, some important issues regarding such combinations, as stability,
long path learning, and the exploration/exploitation dilemma were discussed.
An essential part of the LCS type discussed in this topic is that classifiers
are trained independently. This is not completely true when using LCS with
reinforcement learning, as the target values that the classifiers are trained on
are based on the global prediction, which is formed by all matching classifiers
in combination. In that sense, classifiers interact when forming their action-
value function estimates. Still, besides combining classifier predictions to form
the target values, independent classifier training still forms the basis of this
model type, even when used in combination with RL. Thus, the update equations
 
Search WWH ::




Custom Search