Civil Engineering Reference
In-Depth Information
B.2
Methodology
This study is based on coupling a reinforcement learning application, the mountain
car domain, with an experimental design, and this section describes each component
of this experimental system. We use this work as a proof-of-concept in applying
design of experiments to understand the performance of reinforcement learning.
Consequently, we restrict this work to a relatively simple domain and a fundamental
model-free learning algorithm, however these methods can be used to understand
other learning algorithms in additional domains.
B.2.1
Mountain Car Domain
The mountain car domain (Moore 1990) places a car in a valley, where the goal is to
get the car to drive out of the valley. The car's engine is not powerful enough for it to
drive out of the valley, and the car must instead build up momentum by successively
driving up opposing sides of the valley. The state of the car is defined by its position
x
Ǚ
=
[
1 . 2, 0 . 5] and its velocity
x
[
1 . 5, 1 . 5], and the goal is located at x
0 . 5. At
the beginning of each episode, the x is uniformly randomly sampled from [
1 . 2, 0 . 5]
and
x
Ǚ
=
0. The dynamics of the car follow:
x t + 1 = x t + t x t
x t + 1 = x t + t
m a μ x t
f
9 . 8 m cos (3 x t )
+
=
=
=
where t
0 . 01 is the time step, m
0 . 02 is the car's mass, f
0 . 2 is the engine
force, and μ
0 . 5 is a friction coefficient. The variable a represents the action taken
by the agent, where a
=
1
for driving forwards. At every time step that the car has not reached the goal, the
agent receives a reward (i.e., penalty) r equal to x . When the car reaches the goal,
the agent receives a reward of 1, and the episode ends.
=−
1 for driving backwards, a
=
0 for neutral, and a
=
B.2.2
Agent Representation
A three-layer neural network is used to represent the agent and to learn the value
function V ( s t , a t ), which represents the value of pursuing action a t while in state
s t . The network has 2 inputs, corresponding to the state s t
x t ] T , 21 hidden
nodes, and 3 output nodes that represent the values of the three actions. The hidden
and output layers use tanh and linear transfer functions, respectively. Each time a
network is created, new weights are initialized by uniform random sampling over
[
=
[ x t ,
Ǚ
0 . 1, 0 . 1]. The learning rates ʱ are initialized by layer using a heuristic similar to
that described in Gatti and Embrechts (2012). Input-hidden ( ʱ hi ) and hidden-output
Search WWH ::




Custom Search