Information Technology Reference
In-Depth Information
on V t +1 and its matching function, as described in Sect. 5.2. This completes one
iteration of LCS approximate value iteration.
The only modification introduced by the asynchronous variant is that rather
than updating the value function for all states at once, a single state is picked per
iteration, and the value function is updated for this state, as already described in
Sect. 9.2.2. Let
{
x i 1 , x i 2 ,...
}
be the sequence of states that determine with state
is updated at which iteration. Thus in the t th iteration we compute V t ( x i t )by
(9.24), which results in the sequence {V 1 ( x i 1 ) ,V 2 ( x i 2 ) ,...} thatcanbeusedto
incrementally train the classifiers by a method of choice from Sect. 5.3. For the
asynchronous variant we cannot use batch learning anymore, as not all elements
of V t +1 are available at once.
9.3.4
Q-Learning by Least Mean Squares
So far it was assumed that the transition and reward function of the given
problem are known. To perform model-free RL, the LCS model needs to be
adjusted to handle action-value function estimates rather than value function
estimates by redefining the input state to be the space of all state/action pairs.
Thus, given state x and action a , the matching probability of classifier k is given
by m k ( x ,a ), and the approximation of its action-value by Q k ( x ,a ). Mixing is
also based on state and action, where the mixing coecient for classifier k is
given by g k ( x ,a ). This results in the global approximation of the action-value
for state x and action a to be given by
Q ( x ,a )=
k
g k ( x ,a ) Q k ( x ,a ) .
(9.25)
As describes in Sect. 9.2.7, approximate Q-Learning in based on minimising
(9.18). For independently trained classifiers, each classifier minimises this cost
independently, but with the value function estimate V of the next state based
on the global estimate. Thus the target for Q k for the transition from x t to x t +1
under action a t is
Q t ( x t +1 ,a ) ,
Q t +1 ( x t ,a t )= r x t x t +1 ( a t )+ γ max
a∈A
(9.26)
given that classifier k matches ( x t ,a t ). Using the linear classifier model Q k ( x ,a )=
w k x , each classifier k aims at finding w k that, by (9.18) after t steps, minimises
T
m k ( x m ,a m ) w k x
Q m +1 ( x m ,a m ) 2 ,
(9.27)
m =0
where m k ( x m ,a m ) was introduced to only consider matched state/action pairs.
This standard linear least squares problem can be handled incrementally with
any of the methods discussed in Chap. 5. It cannot be trained by batch learning
as the target function relies on previous updates and is thus non-stationary.
 
Search WWH ::




Custom Search