Information Technology Reference
In-Depth Information
selection, without considering the action-value estimates. In exploitation trials,
on the other hand, all actions are chosen greedily. This causes significant ex-
ploration of the search space, which facilitates learning of the optimal policy. A
drawback of this approach is that on average, the received reward is lower than if
a more reward-oriented policy, like ε -greedy or softmax action selection, is used.
In any case, undirected policies should only be used, if directed exploration is
too costly to implement.
Directed Exploration
Directed exploration is significantly more ecient than undirected exploration
by taking into account the uncertainty of the action-value function estimate. This
allows it to perform sub-optimal actions in order to reduce this uncertainty, until
it is certain that no further exploration is required to find the optimal policy.
The result is less, but more intelligent exploration.
This strategy is implemented by several methods, both model-based and
model-free (for example, [126, 28, 204]). In fact, some of them have shown to be
very e cient in the Probably Approximately Correct (PAC) sense (for example,
[204]). These, however, require a model of the reward and transition function,
and thus they have a larger space complexity than model-free RL methods [205].
Nonetheless, methods that perform intelligent exploration currently outperform
all other RL methods [149]. Recently, also a model-free method with intelligent
exploration became available [205], but according to Littman [149], it performs
really slow” when compared to model-based alternatives. None of the methods
will be discussed in detail, but they all share the same concept of performing
actions such that the certainty of the action-value function estimate is increased.
A recent LCS approach aimed at providing such intelligent exploration [166],
but without considering the RL techniques that are available. These techniques
couldbeusedinLCS even without having models of the transition and reward
function by proceeding in a manner similar to [69], and building a model at the
same time as using it to learn the optimal policy. Anticipatory LCS [202, 48, 40,
91, 90] are already used to model at least the transition function, and can easily
be modified to include the reward function. An LCS that does both has already
been developed to perform model-based RL [92, 89], but as it uses heuristics
rather than evolutionary computation techniques for model structure search,
some LCS workers did not consider it as being an LCS.Inanycase,having
such a model opens the path towards using new exploration control methods to
improve their eciency and performance.
Bayesian Reinforcement Learning
The intelligent exploration methods discussed above either consider the certainty
of the estimate only implicitly, or maintain it by some form of confidence inter-
val. A more elegant approach is to facilitate Bayesian statistics and maintain
complete distributions over each of the estimates.
Search WWH ::




Custom Search