Information Technology Reference
In-Depth Information
state transition occurs when the agent performs an action from the action set
A
. Each of these state transitions is mediated by a scalar reward. The aim of
the agent is to find a policy, which is a mapping
that determines the
action in each state, that maximises the reward in the long run.
While it is possible to search the space of possible policies directly, a more
e cient approach is to compute the value function
X→A
X×A→ R
that determi-
nes for each state which long-term reward to expect when performing a certain
action. If a model of the state transitions and rewards is known, Dynamic Pro-
gramming (DP) can be used to compute this function. Reinforcement Learning
(RL), on the other hand, deals with finding the value function if no such model
is available. As the latter is commonly the case, Reinforcement Learning is also
theapproachemployedbyLCS.
There are two approaches to RL: either one learns a model of the transitions
and rewards by observations and then uses dynamic programming to find the
value function, called model-based RL, or one estimate the value function directly
while interacting with the environment, called model-free RL.
In the model-based case, a model of the state transitions and rewards needs
to be derived from the given observations, both of which are regression tasks. If
the policy is to be computed while sampling the environment, the model needs
to be updated incrementally, which requires an incremental learner.
In the model-free case, the function to model is the estimate of the value func-
tion, again leading to a regression task that needs to be handled incrementally.
Additionally, the value function estimate is also updated incrementally, and as
it is the data-generating process, this process is slowly changing. As a result,
there is a dynamic interaction between the RL algorithm that updates the value
function estimate and the incremental regression learner that models it, which
is not in all cases stable and needs special consideration [25]. These are additio-
nal diculties that need to be taken into account when performing model-free
RL.
Clearly, although the sequential decision task was the prime motivator for
LCS, it is also the most complex to tackle. Therefore, we deal with standard
regression and classification tasks first, and come back to sequential decision
tasks in Chap. 9. Even then it will be only dealt with from the theoretical
perspective of stability, as it requires an incremental learning procedure that
will not be developed here.
3.1.5
Batch vs. Incremental Learning
In batch learning it is assumed that the whole training set is available at once,
and that the order of the observations in that set is irrelevant. Thus, the model
can be trained with all data at once and in any order.
Incremental learning methods differ from batch learning in that the model
is updated with each additional observation separately, and as such can handle
observations that arrive sequentially as a stream. Revisiting the assumption of
 
Search WWH ::




Custom Search