Information Technology Reference
In-Depth Information
the optimal policy is the same as before. Fig. 9.3(b) shows the optimal value
function for the modified problem definition.
Observe that, in contrast to Fig. 9.3(a), all values of V are negative or zero,
and their absolute magnitude grows with the distance from the terminal state.
The difference in magnitude between two successive state, on the other hand,
still decreases with the distance from the terminal state. This clearly violates the
assumption that this difference is proportional to the absolute magnitude of the
values, as the modified problem definition causes exactly the opposite pattern.
Hence, the relative error approach will certainly fail, as it was not designed to
handle such cases.
To create a task where the relative error measure fails, the problem had to be
redefined such that the value function takes exclusively negative values. While
it might be possible to do the opposite and redefine each problem such that it
conforms to the assumption that the relative error measure is based on, an alter-
native that does not require modification of the problem definition is preferable.
A Possible Alternative?
It was shown in Sect. 8.3.4 that the optimality criterion that was introduced in
Chap. 7 is able to handle problem where the noise differs in different areas of
the input space. Given that it is possible to use this criterion in an incremen-
tal implementation, will such an implementation be able to perform long path
learning?
As previously discussed (see Sect. 5.1.2 and 7.2.2), a linear classifier model
attributes all observed deviation from its linear model to measurement noise
(implicitly including the stochasticity of the data-generating process). In rein-
forcement learning, and additional component of stochasticity is introduced by
updating the value function estimates which makes them non-stationary. Thus,
in order for the LCS model to provide a good representation of the value func-
tion estimate, it needs to be able to handle both the measurement noise and the
update noise - a differentiation that is absent Barry's work [11, 12].
Let us assume that the optimality criterion causes the size of the area of the
input space that is matched by a classifier to be proportional to the level of
noise in the data, such that the model is refined in areas where the observations
are known to accurately represent the data-generating process. Considering only
measurement noise, when applied to value function approximation this would
lead to having more specific classifiers in states where the difference in magnitude
of the value function for successive states is low, as in such areas this noise is
deemed to be low. Therefore, the optimality criterion should provide an adequate
value function approximation of the optimal value function, even in cases where
long action sequences need to be represented.
Also considering update noise, its magnitude is related to the magnitude of the
optimal value function, as demonstrated in Fig. 9.4. Therefore, the noise appears
to be largest where the magnitude of the optimal value function is large. Due
to this noise, the model in such areas will most likely be coarse. With respect
to the corridor finite state world, for which the optimal value function is shown
 
Search WWH ::




Custom Search