Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

x t +1

x t

a t

r t

Fig. 9.1. The variables of an MDP involved in a single transition from state x t to

state x t +1 after the agent performed action a t and received reward r t

rewards. A possible solution is represented by a policy μ :

X→A

, which returns

the chosen action a = μ ( x ) for any state x

. With a fixed policy μ ,the

MDP is reduced to a Markov chain with transition probabilities p μ ( x j |

∈X

x i )=

x i ,a = μ ( x i )), and rewards r x i x j = r x i x j ( μ ( x i )). In such cases it is common

to operate with the expected reward r x i

p ( x j |

= j p μ ( x j |

x i ) r x i ,x j .Thisrewardis

the one expected to be received in state x i when actions are chosen according

to policy μ .

9.1.2

The Value Function, the Action-Value Function and

Bellman's Equation

The approach taken by dynamic programming (DP) and reinforcement learning

(RL) is to define a value function V :

that expresses for each state how

much reward we can expect to receive in the long run. While V was previously

used to denote the mixing weight vectors,thosewillnotbereferredtointhis

chapter, and hence any ambiguity is avoided. Let μ =

X→ R

be a sequence

of policies where we use policy μ t at time t , starting at time t = 0. Then, the

reward that is accumulated after n steps when starting at state x , called the

n-step return V n for state x ,isgivenby

{

μ 0 ,μ 1 ,...

}

γ n R ( x n )+

x 0 = x ,

n− 1

V n ( x )=

γ t r μ t

x t x t +1 |

(9.1)

t =0

where

is the sequence of states, and R ( x n ) denotes the expected

return that will be received when starting from state x n .The return differs from

the reward in that it implicitly considers future reward.

In finite horizon cases ,where n<

{

x 0 , x 1 ,...

}

, the optimal policy μ is the one that

maximises the expected return for each state x

∞

, giving the optimal n-step

return V n ( x )=max μ V n ( x ). Finite horizon cases can be seen as a special case of

infinite horizon cases with zero-reward absorbing states [17]. For infinite horizon

cases, the expected return when starting at state x is analogously to (9.1) given

∈X

n− 1

x 0 = x .

V μ ( x ) = lim

γ t r μ t

n→∞ E

x i x i +1 |

(9.2)

t =0

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home