Information Technology Reference
In-Depth Information
If the state space is large or even continuous then the value function is not
learned for each state separately but rather modelled by some function appro-
ximation technique. However, this limits the quality of the discovered policy by
how close the approximated value function is to the real value function. Fur-
thermore, the shape of the value function is not known beforehand, and so the
function approximation technique has to be able to adjust its resources adap-
tively. Considering that LCS provide such adaptive regression models, they seem
to be a key candidate for approximating the value function of RL methods; and
this is in fact exactly what LCS are used for when applied to sequential decision
tasks: they act as adaptive value function approximation methods to aid learning
the value function of RL methods.
Due to early LCS pre-dating common RL methods, they have not always
been characterised as approximating the value function. In fact, the first compa-
rison between RL and LCS was performed by Dorigo and Bersini [74] to show
that a Very Simple CS without generalisation and a slightly modified implicit
bucket brigade is equivalent to tabular Q-Learning. A more general study shows
how evolutionary computation can be used for reinforcement learning [172], but
ignores the development of XCS [237], where Wilson explicitly uses Q-Learning
as the RL component.
Recently, there has been some confusion [47, 223, 142] about how to correctly
implement RL in XCS(F), and this has caused XCS(F) to be modified in various
ways. Using the model-based stance, variants of Q-Learning that use LCS func-
tion approximation from first principles will be derived and show that XCS(F)
already performs correct RL without the need for modifications. Also, it demons-
trates how to correctly combine RL methods and LCS function approximation,
as an illustration of a general principle, applied to the LCS model type that was
introduced in the previous chapters.
Using RL with any form of value function approximation might case the RL
method to become unstable and possibly diverge. Only certain forms of func-
tion approximation can be used with RL - an issue that will be discussed in
detail in a later section, where the compatibility of the introduced LCS model
and RL is analysed. Besides stability, learning policies that require a long se-
quence of actions is another issue that needs special consideration, as function
approximation might cause the policy to become sub-optimal. This, and the
exploration/exploitation dilemma will be discussed, where the latter concerns
the trade-off between exploiting current knowledge in forming the policy and
performing further exploration to gain more knowledge.
Appropriately linking LCS into RL firstly requires a formal basis for RL,
which is formed by various DP methods. Their introduction is kept brief, and
a longer LCS-related version is available as a technical report [79]. Nonethe-
less, we discuss some stability issues that RL is known to have when the value
function is approximated, as these are particularly relevant - though mostly
ignored - when combining RL with LCS. Hence, after showing how to derive
the use of Q-Learning with LCS from first principles in Sect. 9.3 and discus-
sing the recent confusion around XCS(F), Sect. 9.4 shows how to analyse the
 
Search WWH ::




Custom Search