Information Technology Reference
In-Depth Information
squared errors, denoted by s k,t for classifier k at time t , and can the compute
the noise precision by (5.63). By (5.69), the sum of squared errors is updated by
s k,t +1 = λ m ( x t ,a t ) s k,t
(9.32)
+ m k ( x t ,a t )( w k,t x t
Q t +1 ( x t ,a t ))( w k,t +1 x t
Q t +1 ( x t ,a t )) ,
starting with s k, 0 =0.
Even though XCS has already been used with RLS by Lanzi et al. [142, 143],
they have never applied it to sequential decision tasks. We have already investi-
gated the incremental noise precision update as given in this chapter for simple
regression tasks [155], but its empirical evaluation when applied to sequential
decision tasks is still pending.
9.3.6
XCS with Gradient Descent
Some recent work by Butz et al. [47, 45] has caused some confusion over how
XCS performs Q-Learning, and how this can be enhanced by the use of gradient
descent [223, 224, 142, 140, 139]. This section clarifies that XCS(F) in its current
form already performs gradient descent and does not need to be modified, as it
updates the classifiers' weight vectors by (9.28), which is a gradient-based update
algorithm. As XCSF is (besides the MAM update) equivalent to XCS if D X =1,
the following discussion only considers XCSF.
To show the equivalence between XCSF and (9.28), let us have a closer look
at the weight vector update of XCSF: upon arriving at state x t ,XCSFforms
a match set that contains all classifiers for that m k ( x t ,a ) > 0, independent of
the action a . The match set is then partitioned into one subset per possible
action, resulting in
subsets. The subset associated with action a contains
all classifiers for that m k ( x ,a ) > 0, and for each of these subsets the action-
|A|
value estimate Q t ( x t ,a )= k g k ( x t ,a ) Q k,t ( x t ,a ) is calculated, resulting in the
prediction vector ( Q t ( x t ,a 1 ) ,..., Q t ( x t ,a |A| )) that predicts the expected return
for the current state x t and each possible action that can be performed. Ba-
sed on this prediction vector, an action a t is chosen and performed, leading
to the next state x t +1 and reward r x t x t +1 ( a t ). The subset of the match set
that promoted the chosen action becomes the action set that contains all clas-
sifiers such that m k ( x t ,a t ) > 0. At the same time, a new prediction vector
( Q t ( x t +1 ,a 1 ) ,..., Q t ( x t +1 ,a |A| )) for state x t +1 is formed, and its largest ele-
ment is chosen, giving max a∈A
Q t ( x t +1 ,a ). Then, all classifiers in the action set
are updated by the modified delta rule (which is equivalent to the NLMS algo-
rithm) with the target value r x t x t +1 ( a t )+ γ max a∈A
Q t ( x t +1 ,a ). The update in
(9.28) uses exactly this target value, as given by (9.26), and updates the weight
vector of each classifier for which m k ( x t ,a t ) > 0, which are the classifiers in the
action set. This shows that (9.28) describes the weight vector update as it is
performed in XCSF, and therefore XCS(F) performs gradient descent without
any additional modification.
 
Search WWH ::




Custom Search