The Mountain Car Problem - Design of Experiments for Reinforcement Learning

Civil Engineering Reference

In-Depth Information

Chapter 5

The Mountain Car Problem

The mountain car problem (Moore 1990 ) is commonly used as a benchmark rein-

forcement learning problem to evaluate learning algorithms. The problem places a

car in a valley, where the goal is to get the car to drive out of the valley (Fig. 5.1 ).

The car's engine is not powerful enough for it to drive out of the valley, and the car

must instead build up momentum by successively driving up opposing sides of the

valley. The state ( x

=

[ x ,

x ]) of the car is defined by its position x

Ǚ

∈

[

−

1 . 2, 0 . 5]

Ǚ

∈

−

=

and its velocity

x

[

1 . 5, 1 . 5], and the goal is located at x

0 . 5. At the beginning

−

Ǚ

=

of each episode, the x is uniformly randomly sampled from [

1 . 2, 0 . 5] and

x

0.

We define the current position and velocity of the car by x and

x , respectively, and

Ǚ

the position and velocity of the car at the next time step by x

x , respectively.

and

The car's dynamics follow:

x = x + ʔt x

x = x + ʔt

m a − μ x

f

−

9 . 8 m cos (3 x )

+

where ʔt =

0 . 01 is the time step, m =

0 . 02 is the car's mass, f

=

0 . 2 is the engine

force, and μ =

0 . 5 is a friction coefficient. The variable a represents the action taken

by the agent, where a =−

1 for

driving forwards. In other words, at any discrete time step, the driver gets to choose

from these three actions.

1 for driving backwards, a =

0 for neutral, and a =

5.1

Reinforcement Learning Implementation

A three-layered neural network was used to learn the mountain car problem using

the temporal difference algorithm TD( ʻ ) (Sect. 2.2.3). The input to the network was

the state of the car defined by its position and velocity, and the output of the network

attempted to approximate the value function V ( s , a ), or the utility of taking action a

when in state s at time t . Consequently, the network had two input nodes and three

output nodes. The number of hidden nodes was varied during experimentation and

will be discussed later. The hidden layer of the network used a hyperbolic tangent

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home