Towards Reinforcement Learning with LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

9 Towards Reinforcement Learning with LCS

Having until now concentrated on how LCS can handle regression and classi-

fication tasks, this chapter returns to the prime motivator for LCS,whichare

sequential decision tasks. There has been little theoretical LCS work that con-

centrates on these tasks (for example, [30, 224]) despite some obvious problems

that need to be solved [11, 12, 77]. At the same time, other machine learning

methods have constantly improved their performance in handling these tasks

[126, 28, 204], based on extensive theoretical advances. In order to catch up with

these methods, LCS need to refine their theory if they want to be able to feature

competitive performance. This chapter provides a strong basis for further theo-

retical development within the MDP framework, and discusses some currently

relevant issues.

Sequential decision tasks are, in general, characterised by having a set of states

and actions, where an action performed in a particular state causes a transition

to the same or another state. Each transition is mediated by a scalar reward, and

the aim is to perform actions in particular states such that the sum of rewards

received is maximised in the long run. How to choose an action for a given state

is determined by the policy . Even though the space of possible policies could be

searched directly, a more common and more ecient approach is to learn for

each state the sum of future rewards that one can expect to receive from that

state, and derive the optimal policy from that knowledge.

The core of Dynamic Programming (DP) is how to learn the mapping between

states and their associated expected sum of rewards, but to do so requires a model

of the transition probabilities and the rewards that are given. Reinforcement

Learning (RL), on the other hand, aims at learning this mapping, known as

the value function , at the same time as performing the actions, and as such

improves the policy simultaneously. It can do so either without any model of

the transitions and rewards - known as model-free RL - or by modelling the

transitions and rewards from observations and then using DP methods based on

these models to improve the policy - known as model-based RL. Here, we mainly

focus on model-free RL as it is the variant that has been used most frequently

in LCS.

Search WWH ::

Custom Search

Home