Civil Engineering Reference
In-Depth Information
ʵ
770
770
663
663
335
335
0.6
1
332
332
331
331
225
225
112
112
ʳ
770
770
663
663
335
335
0.9
1
332
332
331
331
225
225
112
112
ʻ
770
770
663
663
335
335
0
1
332
332
331
331
225
225
112
112
ʱ ratio
770
770
663
663
335
335
1
8
332
332
331
331
225
225
112
112
ʱ mag
770
770
663
663
335
335
0.00333
0.01
332
332
331
331
225
225
112
112
n hnodes
770
770
663
663
335
335
11
61
332
332
331
331
225
225
112
112
Fig. 6.3 Parameter ranges for the convergent subregions for the TBU problem.
through exploiting (possibly incorrect) knowledge is better than exploring actions,
and this may be due to the fragility of the problem where small errors can result in
the truck jack-knifing.
The next-state discount parameter ʳ generally ranges from 0.96 to 0.99, and this
is quite consistent with what is suggested and used in other implementations in the
literature (Thrun and Schwartz 1993 ; Thrun 1995 ; Gatti et al. 2013 ). However, the
same cannot be said for the temporal discount factor ʻ . We find that this parameter
generally has to be quite low (0.30 and below), which contrasts with the many
implementations of TD( ʻ ) that suggest that ʻ be set to between 0.6 and 0.8 (Tesauro
1992 ; Patist and Wiering 2004 ; Wiering et al. 2007 ; Wiering 2010 ). Lower values
of ʻ pass back little information from time step to time step, and thus learning is
generally perceived to be slow. An important part of the control strategy in this
problem is simply selecting actions that avoid jack-knifing the truck. Much of what
is learned may therefore simply be what actions to perform and what actions to avoid
for any configuration of the truck regardless of the location of the truck. It is possible
that learning this mapping between the truck configuration and selecting actions to
avoid jack-knifing may be a rather static learning problem in that this mapping may
 
Search WWH ::




Custom Search