The Truck Backer-upper Problem - Design of Experiments for Reinforcement Learning - page 113

Civil Engineering Reference

In-Depth Information

Seed run

Iter 1

Iter 2

12

Iter 3

25

3132

35

Iter 4

63

70

Fig. 6.2 Sequential CART process for the TBU problem. Parameter subregions that are candidates

for further experimentation are shown as black circles , subregions that are pruned from further

experimentation are shown as open circles, and convergent subregions are shown as blue squares .

sequential CART algorithm were also LHS designs, consisting of 30 design points,

again with 3 replicates each, for a total of 90 runs.

6.2.1

Convergent Subregions

Figure 6.2 shows the sequential CART process for the TBU problem. Following

the sequential CART procedure, additional screening of the leaf nodes required that

there be at least 12 design points in the subregion and that the dimensionality ratio

was less than or equal to 20. Figure 6.3 shows the ranges of the parameters for each of

the convergent subregions. Table C.3 in the Appendix provides the numerical values

of the convergent subregions.

The convergent domains have some interesting characteristics. For nearly all

of the parameters, the ranges of each of the parameters are a small subset of the

full parameter hypercube. In terms of the neural network, we can see that there is a

minimum required size to the network of between 26 and 40 hidden nodes, suggesting

that a smaller network is not likely to be able to learn this problem. The magnitude

of the learning rates needs to be closer to 0.01, though there is not a consistent range

of the ratio of the learning rates between the layers of the network across convergent

subregions. In some subregions, this ratio does not matter at all or very little (e.g.,

subregions 12, 31, 32, 35, and 70), whereas this ratio can take on values over a small

region in other subregions (e.g., subregions 25 and 63).

There seem to be very specific parameter subregions for the parameters of the

TD( ʻ ) learning algorithm. The action selection exploration/exploitation trade-off

parameter needs to be set high to at least about 0.89, and these regions extend

all the way to 1.00 in some subregions (e.g., 12, 35, and 63). This suggests that

exploiting actions the vast majority of the time is most beneficial. That is, learning

Next Page

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home