Information Technology Reference
In-Depth Information
For a sequence of states
{
x
1
,
x
2
,...
}
and actions
{
a
1
,a
2
,...
}
,theQ-values
are updated by
Q
t
+1
(
x
t
,a
t
)=
Q
t
(
x
t
,a
t
)+
α
t
r
x
t
x
t
+1
(
a
t
)+
γ
max
a∈A
Q
t
(
x
t
,a
t
)
,
(9.17)
where
α
t
denotes the step size at time
t
. As before, the explore/exploit dilemma
applies when selecting actions based on the current
Q
.
A variant of Q-Learning, called Q(
λ
), is an extension that uses eligibility
traces like TD(
λ
) as long as it performs on-policy actions [229]. As soon as an
off-policy action is chosen, all traces are reset to zero, as the off-policy action
breaks the temporal sequence of predictions. Hence, the performance increase
due to traces depends significantly on the policy that is used, but is usually
marginal. In a study by Drugowitsch and Barry [77] it was shown that, when
used in XCS, it performs even worse than standard Q-Learning.
Q
t
(
x
t
+1
,a
)
−
9.2.7 Approximate Reinforcement Learning
Analogous to approximate DP, RL can handle large state spaces by approxi-
mating the action-value function. Given some estimator
Q
that approximates
the action-value function, this estimator is, as before, to be updated after recei-
ving reward
r
t
for a transition from
x
t
to
x
t
+1
when performing action
a
t
.The
estimator's target value is
Q
(
x
t
,a
t
)=
r
x
t
x
t
+1
(
a
t
)+
γ V
(
x
t
+1
), where
V
(
x
t
+1
)is
the currently best estimate of the value of state
x
t
+1
.Thus,attime
t
,theaim
is to find the estimator
Q
that minimises some distance between itself and all
previous target values, which, when assuming the squared distance, results in
minimising
Q
(
x
m
,a
n
)
r
x
m
x
m
+1
(
a
m
)+
γ V
t
(
x
m
+1
)
2
t
−
.
(9.18)
m
=1
As previously shown, Q-Learning uses
V
t
(
x
)=max
a
Q
t
(
x
,a
), and SARSA(0)
relies on
V
t
(
x
)=
Q
t
(
x
,a
), where in case of the latter,
a
is the action performed
in state
x
.
Tabular Q-Learning and SARSA(0) are easily extracted from the above pro-
blem formulation by assuming that each state/action pair is estimated separa-
tely by
Q
(
x
,a
)=
θ
x,a
. Under this assumption, applying the LMS algorithm to
minimising (9.18) directly results in (9.16) or (9.17), depending on how
V
t
is
estimated.
The next section shows from first principles how the same approach can be
applied to performing RL with LCS,thatis,when
Q
is an estimator that is
given by an LCS.
9.3
Reinforcement Learning with LCS
Performing RL with LCS means to use LCS to approximate the action-value
function estimate. RL methods upgrade this estimate incrementally, and we can
Search WWH ::
Custom Search