The Truck Backer-upper Problem - Design of Experiments for Reinforcement Learning - page 114

Civil Engineering Reference

In-Depth Information

ʵ

770

770

663

663

335

335

0.6

1

332

332

331

331

225

225

112

112

ʳ

770

770

663

663

335

335

0.9

1

332

332

331

331

225

225

112

112

ʻ

770

770

663

663

335

335

0

1

332

332

331

331

225

225

112

112

ʱ ratio

770

770

663

663

335

335

1

8

332

332

331

331

225

225

112

112

ʱ mag

770

770

663

663

335

335

0.00333

0.01

332

332

331

331

225

225

112

112

n hnodes

770

770

663

663

335

335

11

61

332

332

331

331

225

225

112

112

Fig. 6.3 Parameter ranges for the convergent subregions for the TBU problem.

through exploiting (possibly incorrect) knowledge is better than exploring actions,

and this may be due to the fragility of the problem where small errors can result in

the truck jack-knifing.

The next-state discount parameter ʳ generally ranges from 0.96 to 0.99, and this

is quite consistent with what is suggested and used in other implementations in the

literature (Thrun and Schwartz 1993 ; Thrun 1995 ; Gatti et al. 2013 ). However, the

same cannot be said for the temporal discount factor ʻ . We find that this parameter

generally has to be quite low (0.30 and below), which contrasts with the many

implementations of TD( ʻ ) that suggest that ʻ be set to between 0.6 and 0.8 (Tesauro

1992 ; Patist and Wiering 2004 ; Wiering et al. 2007 ; Wiering 2010 ). Lower values

of ʻ pass back little information from time step to time step, and thus learning is

generally perceived to be slow. An important part of the control strategy in this

problem is simply selecting actions that avoid jack-knifing the truck. Much of what

is learned may therefore simply be what actions to perform and what actions to avoid

for any configuration of the truck regardless of the location of the truck. It is possible

that learning this mapping between the truck configuration and selecting actions to

avoid jack-knifing may be a rather static learning problem in that this mapping may

Next Page

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home