Information Technology Reference
In-Depth Information
9 Towards Reinforcement Learning with LCS
Having until now concentrated on how LCS can handle regression and classi-
fication tasks, this chapter returns to the prime motivator for LCS,whichare
sequential decision tasks. There has been little theoretical LCS work that con-
centrates on these tasks (for example, [30, 224]) despite some obvious problems
that need to be solved [11, 12, 77]. At the same time, other machine learning
methods have constantly improved their performance in handling these tasks
[126, 28, 204], based on extensive theoretical advances. In order to catch up with
these methods, LCS need to refine their theory if they want to be able to feature
competitive performance. This chapter provides a strong basis for further theo-
retical development within the MDP framework, and discusses some currently
relevant issues.
Sequential decision tasks are, in general, characterised by having a set of states
and actions, where an action performed in a particular state causes a transition
to the same or another state. Each transition is mediated by a scalar reward, and
the aim is to perform actions in particular states such that the sum of rewards
received is maximised in the long run. How to choose an action for a given state
is determined by the policy . Even though the space of possible policies could be
searched directly, a more common and more ecient approach is to learn for
each state the sum of future rewards that one can expect to receive from that
state, and derive the optimal policy from that knowledge.
The core of Dynamic Programming (DP) is how to learn the mapping between
states and their associated expected sum of rewards, but to do so requires a model
of the transition probabilities and the rewards that are given. Reinforcement
Learning (RL), on the other hand, aims at learning this mapping, known as
the value function , at the same time as performing the actions, and as such
improves the policy simultaneously. It can do so either without any model of
the transitions and rewards - known as model-free RL - or by modelling the
transitions and rewards from observations and then using DP methods based on
these models to improve the policy - known as model-based RL. Here, we mainly
focus on model-free RL as it is the variant that has been used most frequently
in LCS.
 
Search WWH ::




Custom Search