Design of Experiments for the Mountain Car Problem - Design of Experiments for Reinforcement Learning

Civil Engineering Reference

In-Depth Information

B.2

Methodology

This study is based on coupling a reinforcement learning application, the mountain

car domain, with an experimental design, and this section describes each component

of this experimental system. We use this work as a proof-of-concept in applying

design of experiments to understand the performance of reinforcement learning.

Consequently, we restrict this work to a relatively simple domain and a fundamental

model-free learning algorithm, however these methods can be used to understand

other learning algorithms in additional domains.

B.2.1

Mountain Car Domain

The mountain car domain (Moore 1990) places a car in a valley, where the goal is to

get the car to drive out of the valley. The car's engine is not powerful enough for it to

drive out of the valley, and the car must instead build up momentum by successively

driving up opposing sides of the valley. The state of the car is defined by its position

∈

−

∈

−

[

1 . 2, 0 . 5] and its velocity

[

1 . 5, 1 . 5], and the goal is located at x

0 . 5. At

−

the beginning of each episode, the x is uniformly randomly sampled from [

1 . 2, 0 . 5]

and

0. The dynamics of the car follow:

x t + 1 = x t + t x t

x t + 1 = x t + t

m a − μ x t

−

9 . 8 m cos (3 x t )

where t

0 . 01 is the time step, m

0 . 02 is the car's mass, f

0 . 2 is the engine

force, and μ

0 . 5 is a friction coefficient. The variable a represents the action taken

by the agent, where a

for driving forwards. At every time step that the car has not reached the goal, the

agent receives a reward (i.e., penalty) r equal to x . When the car reaches the goal,

the agent receives a reward of 1, and the episode ends.

=−

1 for driving backwards, a

0 for neutral, and a

B.2.2

Agent Representation

A three-layer neural network is used to represent the agent and to learn the value

function V ( s t , a t ), which represents the value of pursuing action a t while in state

s t . The network has 2 inputs, corresponding to the state s t

x t ] T , 21 hidden

nodes, and 3 output nodes that represent the values of the three actions. The hidden

and output layers use tanh and linear transfer functions, respectively. Each time a

network is created, new weights are initialized by uniform random sampling over

[

[ x t ,

0 . 1, 0 . 1]. The learning rates ʱ are initialized by layer using a heuristic similar to

that described in Gatti and Embrechts (2012). Input-hidden ( ʱ hi ) and hidden-output

−

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home