Information Technology Reference
In-Depth Information
transitions. In SARSA(0), the actually observed transition replaces the poten-
tial transitions, such that the target value of the estimate Q ( x t ,a t ) becomes
Q ( x t ,a t )= r x t x t +1 ( a t )+ γ Q t ( x t +1 ,a t +1 ). Note that the value of the next state
is approximated by the current action-value function estimate Q t and the as-
sumption that current policy is chosen when choosing the action in the next
state, such that V ( x t +1 )
Q t ( x t +1 ,a t +1 ).
Using Q t +1 ( x t ,a t )= Q ( x t ,a t )wouldnotleadtogoodresultsasitmakes
the update highly dependent on the quality of the policy that is used to select
a t . Instead, the LMS algorithm (see Sect. 5.3.3) is used to minimise the squared
difference between the estimate Q t +1 and its target Q , such that the action-value
function estimate is updated by
Q t +1 ( x t ,a t )= Q t ( x t ,a t )+ α t r x t x t +1 ( a t )+ γ Q t ( x t +1 ,a t +1 )
Q t ( x t ,a t ) ,
(9.16)
where α t denotes the step-size of the LMS algorithm at time t . For all state/action
pairs x
= a t , the action-value function estimates remain unchanged, that
is Q t +1 ( x ,a )= Q t ( x ,a ).
The actions can be chosen according to the current action-value function
estimate, such that a t =argmax a Q t ( x t ,a t ). This causes SARSA(0) to always
perform the action that is assumed to be the reward-maximising one according
to the current estimate. Always following such a policy is not advisable, as it
could cause the method to get stuck in a local optimum by not suciently
exploring the whole state space. Thus, a good balance between exploiting the
current knowledge and exploring the state space by performing seemingly sub-
optimal actions is required. This explore/exploit dilemma is fundamental to RL
methods and will hence be discussed in more detail in a later section. For now let
us just note that the update of Q is based on the state trajectory of the current
policy, even when sub-optimal actions are chosen, such that SARSA is called an
on-policy method.
SARSA( λ )for λ> 0 relies on the operator T ( λ μ rather than T μ . A detailed
discussion of the consequences of this change is beyond the scope of this topic,
but more details are given by Sutton [207] and Sutton and Barto [209].
= x t , a
9.2.6
Q-Learning
The much-celebrated Q-Learning was developed by Watkins [228] as a result of
combining TD-learning and DP methods. It is similar to SARSA(0), but rat-
her than using Q ( x t ,a t )= r x t x t +1 ( a t )+ γ Q t ( x t +1 ,a t )asthetargetvaluefor
Q ( x t ,a t ), it uses Q ( x t ,a t )= r x t x t +1 ( a t )+ γ max a Q t ( x t +1 ,a ), and thus appro-
ximates value iteration rather than policy iteration. SARSA(0) and Q-Learning
are equivalent if both always follow the greedy action a t =argmax a Q t ( x t ,a ),
but this would ignore the explore/exploit dilemma. Q-Learning is called an off-
policy method as the value function estimates
V ( x t +1 )
Q t ( x t +1 ,a )are
max a
independent of the actions that are actually performed.
 
Search WWH ::




Custom Search