Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

transitions. In SARSA(0), the actually observed transition replaces the poten-

tial transitions, such that the target value of the estimate Q ( x t ,a t ) becomes

Q ( x t ,a t )= r x t x t +1 ( a t )+ γ Q t ( x t +1 ,a t +1 ). Note that the value of the next state

is approximated by the current action-value function estimate Q t and the as-

sumption that current policy is chosen when choosing the action in the next

state, such that V ( x t +1 )

Q t ( x t +1 ,a t +1 ).

Using Q t +1 ( x t ,a t )= Q ( x t ,a t )wouldnotleadtogoodresultsasitmakes

the update highly dependent on the quality of the policy that is used to select

a t . Instead, the LMS algorithm (see Sect. 5.3.3) is used to minimise the squared

difference between the estimate Q t +1 and its target Q , such that the action-value

function estimate is updated by

Q t +1 ( x t ,a t )= Q t ( x t ,a t )+ α t r x t x t +1 ( a t )+ γ Q t ( x t +1 ,a t +1 )

≈

Q t ( x t ,a t ) ,

(9.16)

where α t denotes the step-size of the LMS algorithm at time t . For all state/action

pairs x

−

= a t , the action-value function estimates remain unchanged, that

is Q t +1 ( x ,a )= Q t ( x ,a ).

The actions can be chosen according to the current action-value function

estimate, such that a t =argmax a Q t ( x t ,a t ). This causes SARSA(0) to always

perform the action that is assumed to be the reward-maximising one according

to the current estimate. Always following such a policy is not advisable, as it

could cause the method to get stuck in a local optimum by not suciently

exploring the whole state space. Thus, a good balance between exploiting the

current knowledge and exploring the state space by performing seemingly sub-

optimal actions is required. This explore/exploit dilemma is fundamental to RL

methods and will hence be discussed in more detail in a later section. For now let

us just note that the update of Q is based on the state trajectory of the current

policy, even when sub-optimal actions are chosen, such that SARSA is called an

on-policy method.

SARSA( λ )for λ> 0 relies on the operator T ( λ μ rather than T μ . A detailed

discussion of the consequences of this change is beyond the scope of this topic,

but more details are given by Sutton [207] and Sutton and Barto [209].

= x t , a

9.2.6

Q-Learning

The much-celebrated Q-Learning was developed by Watkins [228] as a result of

combining TD-learning and DP methods. It is similar to SARSA(0), but rat-

her than using Q ( x t ,a t )= r x t x t +1 ( a t )+ γ Q t ( x t +1 ,a t )asthetargetvaluefor

Q ( x t ,a t ), it uses Q ( x t ,a t )= r x t x t +1 ( a t )+ γ max a Q t ( x t +1 ,a ), and thus appro-

ximates value iteration rather than policy iteration. SARSA(0) and Q-Learning

are equivalent if both always follow the greedy action a t =argmax a Q t ( x t ,a ),

but this would ignore the explore/exploit dilemma. Q-Learning is called an off-

policy method as the value function estimates

V ( x t +1 )

Q t ( x t +1 ,a )are

≈

max a

independent of the actions that are actually performed.

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home