Information Technology Reference
In-Depth Information
Here we use it specifically for methods based on averaging complete returns
(as opposed to methods that learn from partial returns, considered in the next
chapter). Fig. 10.4 gives the return reward by Monte Carlo sampling for one step
during learning. Then through iterative learning, the actual obtained rewards are
used to approximate the real value function.
Monte Carlo methods do not assume complete knowledge of the environment,
but learn from on-line experience. Monte Carlo methods are ways of solving the
reinforcement learning problem based on averaging sample returns. Given policy
ʩ , compute V ʩ : subsequence state s t under policy ʩ ,
R
(
s
)
is the long reward
t
t
R
V
(
s
)
average
(
R
)
return, add R t (
s t ) to the list
,
.
s
t
s
i
The list could use incremental implementation,
(
R
s
)
V s
(
)
V s
(
)
V s
(
)
+
t
t
t
t
t
N
+
1 (10.11)
s
t
N
N
+
1
s
s
t
t
Under Monte Carlo control, policy evaluation and improvement use the same
random policy as.
*
arg max
Q s a
( ,
)
a
a
e
Ê
*
e
+
,
a
=
a
1
Í
Ë
Í
(10.12)
|
A s
( ) |
π
( ,
s a
)
e
*
,
a
a
Ì
|
A s
( ) |
In learning, if some actions are found to be good, then what action should the
agent select in the next decision-making? One consideration is making full use of
existing knowledge, select the current best action. But it has a drawback: maybe
some better actions are not found; in contrast, if the agent always tests new
actions, it will lead to no progress. The agent faces a tradeoff in choosing
whether to favor exploration of unknown actions (to gather new information), or
exploitation of existing actions that it has already learned will yield high reward
(to maximize its cumulative reward). These are two main methods: e-greedy
Search WWH ::




Custom Search