Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

stability of RL when used with LCS. Learning of long action sequences is ano-

ther issue that XCS is known to struggle with [11], and a previously proposed

solution [12] only applies to certain problem types. If the introduced optimality

criterion provides a potential solution is still an open question, but the outlook

is good, as will be discussed before providing an LCS-related overview of the

exploration/exploitation dilemma. But firstly, let us define sequential decision

tasks more formally in Sect. 9.1, and introduce DP and RL methods that provide

solutions to such tasks in Sect. 9.2.

9.1

Problem Definition

The sequential decision tasks that will be considered are the ones describable by

a Markov Decision Process (MDP) (see Sect. 2.1). To stay close to the notation

that is common in the literature [17, 209], some of the previously used symbols

will be assigned a new meaning. The definitions given in this section are similar

to the ones used by Bertsekas and Tsitsiklis [17] and Drugowitsch and Barry [79].

9.1.1

Markov Decision Processes

Let

be the set of states x

∈X

of the problem domain, that is assumed to

be of finite size 1

was

previously defined to be the input space, but as the states are identified by

the input that is determined by the environmental state, state and input are

used interchangeably. In every state x i

N , and hence is mapped into the natural numbers

is performed and causes a state transition to x j . The probability of getting to

state x j after performing action a in state x i is given by the transition function

p ( x j |

∈X

,anaction a out of a finite set

Each such transition is meditated by a scalar reward r x i x j ( a ), defined by the

reward function r :

x i , a ), which is a probability distribution over

, conditional on

X×A

X×X×A→ R

. The positive discount factor γ

∈ R

with

0 <γ

1 determines the preference of immediate reward over future reward.

Therefore, the MDP that describes the problem is defined by the quintuple

≤

2 . The involved variables are illustrated in Fig. 2.1(b), which is

reproduced in Fig. 9.1 for convenience. Previously, γ denoted the step size for

gradient-based incremental methods in Chap. 5. In this chapter, the step size

will be denoted by α to conform to the RL literature [209].

The aim is for every state to choose the action that maximises the reward

in the long run, where future rewards are possibly valued less that immediate

1 Assuming a finite state space simplifies the presentation. Extending it to a con-

tinuous state space requires considerably more technical work. For examples of an

analysis of reinforcement learning in continuous state spaces see Konda and Tsitsiklis

[129] and Ormoneit and Sen [177].

2 The problem definition and with it the solution to the problem changes when the

discount rate γ is changed. Thus, it is important to consider the discount rate γ as

part of the problem rather than a tunable parameter. This fact is ignored in some

LCS research, where the discount rate is modified to make the task seemingly easier

to learn, when, in fact, the task itself is changed.

,p,r,γ

}

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home