Information Technology Reference
In-Depth Information
might cause the agent to settle on a sub-optimal policy due to insucient know-
ledge of the reward distribution [228, 209]. Keeping a good balance is important
as it has a significant impact on the performance of RL methods.
There are several approaches to handling exploration and exploitation: one
can choose a sub-optimal action every now and then, independent of the cer-
tainty of the available knowledge, or one can take this certainty into account to
choose actions that increase it. A variant of the latter is to use Bayesian statistics
to model this uncertainty, which seems the most elegant solution but is unfor-
tunately also the least tractable. All of these variants and their applicability to
RL with LCS are discussed below.
Undirected Exploration
A certain degree of exploration can be achieved by selecting a sub-optimal action
every now and then. This form of exploration is called undirected as it does not
take into account the certainty about the current value or action-value function
estimate. The probably most popular instances of this exploration type are the
ε -greedy policy and softmax action selection.
The greedy policy is the one the chooses the action that is to the current know-
ledge optimal at each step, as is thus given by μ ( x )=max a Q ( x ,a ). In contrast,
the ε -greedy policy selects a random sub-optimal action with probability ε ,and
the greedy action otherwise. Its stochastic policy is given by
Q ( x ,a ) ,
1
ε
if a =argmax a∈A
μ ( a
|
x )=
,
(9.51)
ε/ (
|A| −
1)
otherwise .
where μ ( a
x ) denotes the probability of choosing action a in state x .
ε -greedy does not consider the magnitude of the action-value function when
choosing the action and thus does not differentiate between actions that are only
slightly sub-optimal and ones that are significantly. This is accounted for by soft-
max action selection, where actions are chosen in proportion to the magnitude
of the estimate of their associated expected return. One possible implementation
is to sample actions from the Boltzmann distribution, given by
|
exp( Q ( x ,a ) /T )
a ∈A exp( Q ( x ,a ) /T )
μ ( a
|
x )=
,
(9.52)
where T is the temperature that allows regulating the strength with which the
magnitude of the expected return is taken into account. A low temperature
T
,onthe
other hand, makes the stochastic policy choose all actions with equal probability.
XCS(F) also uses indirect exploration, but with neither of the above policies.
Instead, it alternates between exploration and exploitation trials, where a single
trial is a sequence of transitions until a goal state is reached or the maximum
number of steps is exceeded. Exploration trials feature a uniform random action
0 causes greedy action selection. Raising the temperature T
→∞
Search WWH ::




Custom Search