Information Technology Reference
In-Depth Information
only use LCS with RL if the LCS implementation can handle incremental model
parameter updates. Additionally, while approximating the action-value function
is a simple univariate regression task, the function estimate to approximate is
non-stationary due to its sequential update. Thus, in addition to incremental
learning, the LCS implementation needs to be able to handle non-stationary
target functions.
This section demonstrates how to derive Q-Learning with the LCS model as
introduced in Chap. 4, to act as a template for how any LCS model type can
be used for RL. Some of the introduced principles and derivations are specific
to the LCS model with independently trained classifiers, but the underlying
ideas also transfer to other LCS model types. The derivations themselves are
performed from first principles to make explicit the usually implicit design deci-
sions. Concentrating purely on incremental model parameter learning, the model
structure
is assumed to be constant. In particular, the derivation focuses on
the classifier parameter updates, as these are the most relevant with respect to
RL.
Even though the Bayesian update equations from Chap. 7 protect against
overfitting, this section falls back to maximum likelihood learning that was the
basis for the incremental methods described in Chaps. 5 and 6. The resulting
update equations conform exactly to XCS(F), which reveals its design principles
and should clarify some of the recent confusion about how to implement gradi-
ent descent in XCS(F). An additional bonus is a more accurate noise precision
update method for XCS(F) based on the methods developed in Chap. 5.
Firstly, the LCS approximation operator is introduced, that conforms to the
LCS model type of this work. This is followed by discussing how the principle of
independent classifier training relates to how DP and RL update the value and
action-value function estimates, which is essential for the use of this LCS model
type to perform RL. As Q-Learning is based on asynchronous value iteration,
it will be firstly shown how LCS can perform asynchronous value iteration,
followed by the derivation of two Q-Learning variants - one based on LMS, and
the other on RLS. Finally, these derivations are related to the recent controversy
about how XCS(F) correctly performs Q-Learning with gradient descent.
M
9.3.1 Approximating the Value Function
Given a value vector V , LCS approximates it by a set of K localised models
{
that are combined to form a global model V . The localised models are
provided by the classifiers, and the mixing model is used to combine these to
the global model.
Each classifier k matches a subset of the state space that is determined by its
matching function m k which returns for each state x the probability m k ( x )of
matching it. Let us for now assume that we approximate the value function V
rather than the action-value function Q . Then, classifier k provides the probabi-
listic model p ( V
V k }
x , θ k ) that gives the probability of the expected return of state
x having the value V . Assuming linear classifiers (5.3), this model is given by
p ( V
|
w k x 1
|
x , θ k )=
N
( V
|
) ,
(9.19)
k
 
Search WWH ::




Custom Search