Reinforcement Learning - Advanced Artificial Intelligence

Information Technology Reference

In-Depth Information

Here we use it specifically for methods based on averaging complete returns

(as opposed to methods that learn from partial returns, considered in the next

chapter). Fig. 10.4 gives the return reward by Monte Carlo sampling for one step

during learning. Then through iterative learning, the actual obtained rewards are

used to approximate the real value function.

Monte Carlo methods do not assume complete knowledge of the environment,

but learn from on-line experience. Monte Carlo methods are ways of solving the

reinforcement learning problem based on averaging sample returns. Given policy

ʩ , compute V ʩ : subsequence state s t under policy ʩ ,

R

(

s

)

is the long reward

t

R

V

(

s

)

←

average

(

R

)

return, add R t (

s t ) to the list

,

.

s

t

s

i

The list could use incremental implementation,

(

R

s

)

−

V s

(

)

V s

(

)

←

V s

(

)

+

t

N

+

1 (10.11)

s

t

N

←

N

+

1

s

t

Under Monte Carlo control, policy evaluation and improvement use the same

random policy as.

*

←

arg max

Q s a

( ,

)

a

e

Ê

*

−

e

+

,

a

=

a

1

Í

← Ë

Í

(10.12)

|

A s

( ) |

π

( ,

s a

)

e

*

,

a

≠

a

Ì

|

A s

( ) |

In learning, if some actions are found to be good, then what action should the

agent select in the next decision-making? One consideration is making full use of

existing knowledge, select the current best action. But it has a drawback: maybe

some better actions are not found; in contrast, if the agent always tests new

actions, it will lead to no progress. The agent faces a tradeoff in choosing

whether to favor exploration of unknown actions (to gather new information), or

exploitation of existing actions that it has already learned will yield high reward

(to maximize its cumulative reward). These are two main methods: e-greedy

Advanced Artificial Intelligence

Search WWH ::

Custom Search

Home