Information Technology Reference
In-Depth Information
stability of RL when used with LCS. Learning of long action sequences is ano-
ther issue that XCS is known to struggle with [11], and a previously proposed
solution [12] only applies to certain problem types. If the introduced optimality
criterion provides a potential solution is still an open question, but the outlook
is good, as will be discussed before providing an LCS-related overview of the
exploration/exploitation dilemma. But firstly, let us define sequential decision
tasks more formally in Sect. 9.1, and introduce DP and RL methods that provide
solutions to such tasks in Sect. 9.2.
9.1
Problem Definition
The sequential decision tasks that will be considered are the ones describable by
a Markov Decision Process (MDP) (see Sect. 2.1). To stay close to the notation
that is common in the literature [17, 209], some of the previously used symbols
will be assigned a new meaning. The definitions given in this section are similar
to the ones used by Bertsekas and Tsitsiklis [17] and Drugowitsch and Barry [79].
9.1.1
Markov Decision Processes
Let
X
be the set of states x
∈X
of the problem domain, that is assumed to
be of finite size 1
was
previously defined to be the input space, but as the states are identified by
the input that is determined by the environmental state, state and input are
used interchangeably. In every state x i
N , and hence is mapped into the natural numbers
N
.
X
A
is performed and causes a state transition to x j . The probability of getting to
state x j after performing action a in state x i is given by the transition function
p ( x j |
∈X
,anaction a out of a finite set
.
Each such transition is meditated by a scalar reward r x i x j ( a ), defined by the
reward function r :
x i , a ), which is a probability distribution over
X
, conditional on
X×A
X×X×A→ R
. The positive discount factor γ
R
with
0
1 determines the preference of immediate reward over future reward.
Therefore, the MDP that describes the problem is defined by the quintuple
{X
2 . The involved variables are illustrated in Fig. 2.1(b), which is
reproduced in Fig. 9.1 for convenience. Previously, γ denoted the step size for
gradient-based incremental methods in Chap. 5. In this chapter, the step size
will be denoted by α to conform to the RL literature [209].
The aim is for every state to choose the action that maximises the reward
in the long run, where future rewards are possibly valued less that immediate
1 Assuming a finite state space simplifies the presentation. Extending it to a con-
tinuous state space requires considerably more technical work. For examples of an
analysis of reinforcement learning in continuous state spaces see Konda and Tsitsiklis
[129] and Ormoneit and Sen [177].
2 The problem definition and with it the solution to the problem changes when the
discount rate γ is changed. Thus, it is important to consider the discount rate γ as
part of the problem rather than a tunable parameter. This fact is ignored in some
LCS research, where the discount rate is modified to make the task seemingly easier
to learn, when, in fact, the task itself is changed.
,
A
,p,r,γ
}
 
Search WWH ::




Custom Search