Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

might cause the agent to settle on a sub-optimal policy due to insucient know-

ledge of the reward distribution [228, 209]. Keeping a good balance is important

as it has a significant impact on the performance of RL methods.

There are several approaches to handling exploration and exploitation: one

can choose a sub-optimal action every now and then, independent of the cer-

tainty of the available knowledge, or one can take this certainty into account to

choose actions that increase it. A variant of the latter is to use Bayesian statistics

to model this uncertainty, which seems the most elegant solution but is unfor-

tunately also the least tractable. All of these variants and their applicability to

RL with LCS are discussed below.

Undirected Exploration

A certain degree of exploration can be achieved by selecting a sub-optimal action

every now and then. This form of exploration is called undirected as it does not

take into account the certainty about the current value or action-value function

estimate. The probably most popular instances of this exploration type are the

ε -greedy policy and softmax action selection.

The greedy policy is the one the chooses the action that is to the current know-

ledge optimal at each step, as is thus given by μ ( x )=max a Q ( x ,a ). In contrast,

the ε -greedy policy selects a random sub-optimal action with probability ε ,and

the greedy action otherwise. Its stochastic policy is given by

⎧

⎨

Q ( x ,a ) ,

1

−

ε

if a =argmax a∈A

μ ( a

|

x )=

,

(9.51)

⎩

ε/ (

|A| −

1)

otherwise .

where μ ( a

x ) denotes the probability of choosing action a in state x .

ε -greedy does not consider the magnitude of the action-value function when

choosing the action and thus does not differentiate between actions that are only

slightly sub-optimal and ones that are significantly. This is accounted for by soft-

max action selection, where actions are chosen in proportion to the magnitude

of the estimate of their associated expected return. One possible implementation

is to sample actions from the Boltzmann distribution, given by

|

exp( Q ( x ,a ) /T )

a ∈A exp( Q ( x ,a ) /T )

μ ( a

|

x )=

,

(9.52)

where T is the temperature that allows regulating the strength with which the

magnitude of the expected return is taken into account. A low temperature

T

,onthe

other hand, makes the stochastic policy choose all actions with equal probability.

XCS(F) also uses indirect exploration, but with neither of the above policies.

Instead, it alternates between exploration and exploitation trials, where a single

trial is a sequence of transitions until a goal state is reached or the maximum

number of steps is exceeded. Exploration trials feature a uniform random action

→

0 causes greedy action selection. Raising the temperature T

→∞

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home