Civil Engineering Reference
In-Depth Information
of backgammon (Tesauro 1990 , 1994 ), and it is still used by practitioners (Dann et
al. 2014 ). Despite its use, questions remain concerning why learning occurs under
some problems domains and learning is limited or does not occur in other domains.
Furthermore, our understanding of this algorithm when paired with a neural network
is even more limited. The inconsistencies in the success when using the TD( ʻ )
algorithm with a neural network, as well as both successful and unsuccessful personal
experiences with these methods, were the motivation for this work.
8.1.1
Parameter Effects
The temporal discount factor ʻ essentially scales the influence of weight updates
from previous time steps, and it therefore attenuates how far information (in terms of
weight changes) is propagated through time. In general, the literature suggests that
setting this parameter between 0.6 and 0.8 works well in many applications (Tesauro
1992 ; Patist and Wiering 2004 ; Wiering et al. 2007 ; Wiering 2010 ), and there is
little work that explores setting this to other values (Sutton and Barto 1998 ). In the
mountain car problem, we found that ʻ could take on values over [0, 1] over all
convergent subregions, though each individual subregion had a somewhat smaller
(but still large) range of acceptable values. In the TBU problem, ʻ had to be quite
consistently on the lower end of [0, 1] for all convergent subregions. In the TTBU
problem, although experimentation evaluated a smaller range of ʻ , this parameter had
to be between [0.5, 0.7] for the two convergent subregions. Our results are therefore
both consistent and inconsistent with what is recommended in the literature.
The convergent parameter subregions for the action exploration-exploitation pa-
rameter were also inconsistent for the three domain problems. The mountain car
domain had both small and large ranges for depending on the subregion, with
some of the lower bounds of these subregions extending down to 0.6, which results
in nearly as much action exploration as there is exploitation. In the TBU problem,
the parameter ranges for were also consistent and similar across the convergent
subregions (similar to the mountain car problem), and was required to range from
approximately 0.9- 1.0 (across all subregions). Practically, this means that the ma-
jority of the time, knowledge about which actions to take was learned from trying
actions that were perceived to be useful based on the current knowledge of the agent,
rather than exploring other, non-knowledge based actions. In the TTBU problem,
could range over the entire parameter space of [0.85, 0.97] that was evaluated. The
differences between the specificity of in the TBU and TTBU problems could be
due to the fact that the goal criteria in the TBU problem was stricter, and due to the
larger state- and action-space of the TTBU problem that required more exploration
of the actions.
The next-state discount factor ʳ attenuates the value of the next state value in
the TD( ʻ ) algorithm. Relative to ʻ and , ʳ is paid little attention in the literature,
and some work even neglects including it in the description of the TD( ʻ ) algo-
rithm (Tesauro 1992 ). Some work suggests that no learning will occur if ʳ
=
1
Search WWH ::




Custom Search