The Tandem Truck Backer-Upper Problem - Design of Experiments for Reinforcement Learning - page 129

Civil Engineering Reference

In-Depth Information

Table 7.2 Variables and their associated ranges used in sequential CART for the TTBU problem.

Variable

Description

Range

RL component

ʱ mag

Base (input-hidden layer) learning rate

[0.0001, 0.01]

Neural network

ʱ ratio

Learning rate ratio

[2.0, 5.0]

Neural network

ʻ

Temporal discount factor

[0.4, 0.7]

TD( ʻ ) algorithm

ʳ

Next-state discount factor

[0.96, 0.99]

TD( ʻ ) algorithm

P (action exploitation)

[0.85, 0.97]

TD( ʻ ) algorithm

and based on preliminary testing. We do not claim that 51 nodes is an ideal number

of nodes in the hidden layer, and experimentation including this as a variable could

be performed, but removing this variable reduces the complexity of the experimen-

tation. The input layer had five nodes (for the five state variables), and the output

layer had nine nodes (for the nine possible actions). Prior to passing the state into

the neural network, the state variables x and y were scaled to [

−

0 . 1, 0 . 1] (relative

to the domain bounds of x , y

=

[

−

100, 100]) so that they were on a similar scale as

the truck angles ʸ 0 , ʸ 2 , and ʸ 4 .

7.2

Sequential CART

We investigated the convergence in the TTBU domain for five different parameters

of the neural network and of the TD( ʻ ) algorithm. The parameters and their initial

ranges are shown in Table 7.2 . These parameter ranges are slightly smaller than those

used in the previous problems, and this was done to have a slightly more focused

experiment due to the longer simulation times of the TTBU domain. The ranges of

these parameters were chosen based on prior experience with these methods, and

with little excess range on either end of the generally used parameters, which still

results in a rather large parameter space.

The parameters used in the sequential CART modeling are shown in Table 7.3 .

Each design point evaluated using sequential CART consisted of having the agent

attempt to learn the TTBU domain in 10,000 episodes, where an episode consists

of one attempt at backing the tandem trailer truck to the goal location. The initial

experimental design therefore consisted of 125 unique design points, with 3 replicates

each, totaling 375 initial runs, which was generated using Latin Hypercube sampling

(LHS). Subsequent designs for each iteration of the sequential CART algorithm were

also LHS designs, consisting of 25 design points, again with 3 replicates each, for

a total of 75 runs. Only three iterations of sequential CART was used because we

allowed for fewer design points to fall into each leaf. When fewer design points

are allowed to fall into each leaf, the CART model will often have more leaves,

which results in more parameter subregions to explore in subsequent iterations, thus

increasing the computation time.

Next Page

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home