Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

selection, without considering the action-value estimates. In exploitation trials,

on the other hand, all actions are chosen greedily. This causes significant ex-

ploration of the search space, which facilitates learning of the optimal policy. A

drawback of this approach is that on average, the received reward is lower than if

a more reward-oriented policy, like ε -greedy or softmax action selection, is used.

In any case, undirected policies should only be used, if directed exploration is

too costly to implement.

Directed Exploration

Directed exploration is significantly more ecient than undirected exploration

by taking into account the uncertainty of the action-value function estimate. This

allows it to perform sub-optimal actions in order to reduce this uncertainty, until

it is certain that no further exploration is required to find the optimal policy.

The result is less, but more intelligent exploration.

This strategy is implemented by several methods, both model-based and

model-free (for example, [126, 28, 204]). In fact, some of them have shown to be

very e cient in the Probably Approximately Correct (PAC) sense (for example,

[204]). These, however, require a model of the reward and transition function,

and thus they have a larger space complexity than model-free RL methods [205].

Nonetheless, methods that perform intelligent exploration currently outperform

all other RL methods [149]. Recently, also a model-free method with intelligent

exploration became available [205], but according to Littman [149], it performs

“ really slow” when compared to model-based alternatives. None of the methods

will be discussed in detail, but they all share the same concept of performing

actions such that the certainty of the action-value function estimate is increased.

A recent LCS approach aimed at providing such intelligent exploration [166],

but without considering the RL techniques that are available. These techniques

couldbeusedinLCS even without having models of the transition and reward

function by proceeding in a manner similar to [69], and building a model at the

same time as using it to learn the optimal policy. Anticipatory LCS [202, 48, 40,

91, 90] are already used to model at least the transition function, and can easily

be modified to include the reward function. An LCS that does both has already

been developed to perform model-based RL [92, 89], but as it uses heuristics

rather than evolutionary computation techniques for model structure search,

some LCS workers did not consider it as being an LCS.Inanycase,having

such a model opens the path towards using new exploration control methods to

improve their eciency and performance.

Bayesian Reinforcement Learning

The intelligent exploration methods discussed above either consider the certainty

of the estimate only implicitly, or maintain it by some form of confidence inter-

val. A more elegant approach is to facilitate Bayesian statistics and maintain

complete distributions over each of the estimates.

Search WWH ::

Custom Search

Home