Information Technology Reference
In-Depth Information
transitions. In SARSA(0), the actually observed transition replaces the poten-
tial transitions, such that the target value of the estimate
Q
(
x
t
,a
t
) becomes
Q
(
x
t
,a
t
)=
r
x
t
x
t
+1
(
a
t
)+
γ Q
t
(
x
t
+1
,a
t
+1
). Note that the value of the next state
is approximated by the current action-value function estimate
Q
t
and the as-
sumption that current policy is chosen when choosing the action in the next
state, such that
V
(
x
t
+1
)
Q
t
(
x
t
+1
,a
t
+1
).
Using
Q
t
+1
(
x
t
,a
t
)=
Q
(
x
t
,a
t
)wouldnotleadtogoodresultsasitmakes
the update highly dependent on the quality of the policy that is used to select
a
t
. Instead, the LMS algorithm (see Sect. 5.3.3) is used to minimise the squared
difference between the estimate
Q
t
+1
and its target
Q
, such that the action-value
function estimate is updated by
Q
t
+1
(
x
t
,a
t
)=
Q
t
(
x
t
,a
t
)+
α
t
r
x
t
x
t
+1
(
a
t
)+
γ Q
t
(
x
t
+1
,a
t
+1
)
≈
Q
t
(
x
t
,a
t
)
,
(9.16)
where
α
t
denotes the step-size of the LMS algorithm at time
t
. For all state/action
pairs
x
−
=
a
t
, the action-value function estimates remain unchanged, that
is
Q
t
+1
(
x
,a
)=
Q
t
(
x
,a
).
The actions can be chosen according to the current action-value function
estimate, such that
a
t
=argmax
a
Q
t
(
x
t
,a
t
). This causes SARSA(0) to always
perform the action that is assumed to be the reward-maximising one according
to the current estimate. Always following such a policy is not advisable, as it
could cause the method to get stuck in a local optimum by not suciently
exploring the whole state space. Thus, a good balance between exploiting the
current knowledge and exploring the state space by performing seemingly sub-
optimal actions is required. This explore/exploit dilemma is fundamental to RL
methods and will hence be discussed in more detail in a later section. For now let
us just note that the update of
Q
is based on the state trajectory of the current
policy, even when sub-optimal actions are chosen, such that SARSA is called an
on-policy
method.
SARSA(
λ
)for
λ>
0 relies on the operator T
(
λ
μ
rather than T
μ
. A detailed
discussion of the consequences of this change is beyond the scope of this topic,
but more details are given by Sutton [207] and Sutton and Barto [209].
=
x
t
,
a
9.2.6
Q-Learning
The much-celebrated Q-Learning was developed by Watkins [228] as a result of
combining TD-learning and DP methods. It is similar to SARSA(0), but rat-
her than using
Q
(
x
t
,a
t
)=
r
x
t
x
t
+1
(
a
t
)+
γ Q
t
(
x
t
+1
,a
t
)asthetargetvaluefor
Q
(
x
t
,a
t
), it uses
Q
(
x
t
,a
t
)=
r
x
t
x
t
+1
(
a
t
)+
γ
max
a
Q
t
(
x
t
+1
,a
), and thus appro-
ximates value iteration rather than policy iteration. SARSA(0) and Q-Learning
are equivalent if both always follow the greedy action
a
t
=argmax
a
Q
t
(
x
t
,a
),
but this would ignore the explore/exploit dilemma. Q-Learning is called an
off-
policy
method as the value function estimates
V
(
x
t
+1
)
Q
t
(
x
t
+1
,a
)are
≈
max
a
independent of the actions that are actually performed.
Search WWH ::
Custom Search