Information Technology Reference
In-Depth Information
For a sequence of states
{
x 1 , x 2 ,...
}
and actions
{
a 1 ,a 2 ,...
}
,theQ-values
are updated by
Q t +1 ( x t ,a t )= Q t ( x t ,a t )+ α t r x t x t +1 ( a t )+ γ max
a∈A
Q t ( x t ,a t ) ,
(9.17)
where α t denotes the step size at time t . As before, the explore/exploit dilemma
applies when selecting actions based on the current Q .
A variant of Q-Learning, called Q( λ ), is an extension that uses eligibility
traces like TD( λ ) as long as it performs on-policy actions [229]. As soon as an
off-policy action is chosen, all traces are reset to zero, as the off-policy action
breaks the temporal sequence of predictions. Hence, the performance increase
due to traces depends significantly on the policy that is used, but is usually
marginal. In a study by Drugowitsch and Barry [77] it was shown that, when
used in XCS, it performs even worse than standard Q-Learning.
Q t ( x t +1 ,a )
9.2.7 Approximate Reinforcement Learning
Analogous to approximate DP, RL can handle large state spaces by approxi-
mating the action-value function. Given some estimator Q that approximates
the action-value function, this estimator is, as before, to be updated after recei-
ving reward r t for a transition from x t to x t +1 when performing action a t .The
estimator's target value is Q ( x t ,a t )= r x t x t +1 ( a t )+ γ V ( x t +1 ), where V ( x t +1 )is
the currently best estimate of the value of state x t +1 .Thus,attime t ,theaim
is to find the estimator Q that minimises some distance between itself and all
previous target values, which, when assuming the squared distance, results in
minimising
Q ( x m ,a n )
r x m x m +1 ( a m )+ γ V t ( x m +1 ) 2
t
.
(9.18)
m =1
As previously shown, Q-Learning uses V t ( x )=max a Q t ( x ,a ), and SARSA(0)
relies on V t ( x )= Q t ( x ,a ), where in case of the latter, a is the action performed
in state x .
Tabular Q-Learning and SARSA(0) are easily extracted from the above pro-
blem formulation by assuming that each state/action pair is estimated separa-
tely by Q ( x ,a )= θ x,a . Under this assumption, applying the LMS algorithm to
minimising (9.18) directly results in (9.16) or (9.17), depending on how
V t is
estimated.
The next section shows from first principles how the same approach can be
applied to performing RL with LCS,thatis,when Q is an estimator that is
given by an LCS.
9.3
Reinforcement Learning with LCS
Performing RL with LCS means to use LCS to approximate the action-value
function estimate. RL methods upgrade this estimate incrementally, and we can
Search WWH ::




Custom Search