Information Technology Reference
In-Depth Information
Optimal V for Single Reward 1 at Terminal State
Optimal V for Reward -1 for Each Action
1
0
-1
0.8
-2
-3
0.6
-4
-5
0.4
-6
-7
0.2
-8
0
-9
14
12
10
8
6
4
2
0
14
12
10
8
6
4
2
0
Steps from Terminal State
Steps from Terminal State
(a)
(b)
Fig. 9.3. Plots showing the optimal value function for the 15-step corridor finite state
world for γ =0 . 9. The value function in (a) results from describing the task by giving
a single reward 1 upon reaching the terminal state, and a reward of 0 for all other
transitions. In (b) the values are based on a task description that gives a reward of
1
for all transitions. Note that in both cases the optimal policy is the same, but in (a)
all values are positive, and in (b) they are negative.
expected to suffer from the same problem, even though that remains to be shown
empirically.
Using the Relative Error
Barry proposed two preliminary approaches to handle the problem in long path
learning in XCS, both based on making the error calculation of a classifier relative
to its prediction of the value function [12]. The first approach is to estimate
the distance of the matched states to the terminal state and scale the error
accordingly, but this approach suffers from the inaccuracy of predicting this
distance.
A second, more promising alternative proposed in his study is to scale the
measured prediction error by the inverse absolute magnitude of the prediction.
The underlying assumption is that the difference in optimal values between two
successive states is proportional to the absolute magnitude of these values, as
can be see in Fig. 9.3(a). Consequently, the relative error is larger for states
that are further away from the terminal state, and overly general classifiers are
identified as such. This modification allows XCS to find the optimal policy in the
15-step corridor finite state world, which it fails to do without the modification.
Where it Fails
The problem of finding the shortest path to the terminal state can also be defined
differently: rather than giving a single reward of 1 upon reaching the terminal
state, one can alternatively punish each transition with a reward of
1. As the
reward is to be maximised, the number of transitions is minimised, and therefore
 
Search WWH ::




Custom Search