Information Technology Reference
In-Depth Information
Update Noise for Single Reward 1 at Terminal State
Update Nosie for Reward -1 for Each Action
0.1
6
5
0.08
4
0.06
3
0.04
2
0.02
1
0
0
0
5
10
15
20
25
30
0
5
10
15
20
25
30
State
State
(a)
(b)
Fig. 9.4. Update noise variance for value iteration performed on 15-step corridor
finite state world. Plot (a) shows the variance when a reward of 1 is given upon re-
aching the terminal state, and 0 for all other transitions. Plot (b) shows the same
when rewarding each transition with 1. The states are enumerated in the order
x 1 a ,x 1 b ,x 2 a ,...,x 15 b ,x 16 . The noise variance is determined by initialising the value
vector to 0 for each state, and storing the value vector after each iteration of value
iteration, until convergence. The noise variance is the variance of the values of each
state over all iterations. It clearly shows that this variance is higher for states which
have a larger absolute optimal value. The optimal values are shown in Fig. 9.3.
in Fig. 9.3(b), this would have the effect of providing an overly coarse model for
states that are distant from the terminal state, and thus might cause the policy
to be sub-optimal, just as in XCS. However, this depends heavily on the dynamic
interaction between the RL method and the incremental LCS implementation.
Thus, definite statements needs to be postponed until such an implementation
is available.
Overall, the introduced optimality criterion seems to be a promising approach
to handle long path learning in LCS, when considering only measurement noise.
Given the additional update noise, however, the criterion might suffer from the
same problems as the approach based on the relative error. The significance of its
influence cannot be evaluated before an incremental implementation is available.
Alternatively, it might be possible to seek for RL approaches that allow for the
differentiation between measurement and update noise, which makes it possible
for the model itself to only concentrate on the measurement noise. If such an
approach is feasible still needs to be investigated.
9.5.2
Exploration and Exploitation
Maintaining the balance between exploiting current knowledge to guide action
selection and exploring the state space to gain new knowledge is an essential
problem for reinforcement learning. Too much exploration implies the frequent
selection of sub-optimal actions and causes the accumulated reward to decrease.
Too much emphasis on exploitation of current knowledge, on the other hand,
 
Search WWH ::




Custom Search