Discussion - Design of Experiments for Reinforcement Learning

Civil Engineering Reference

In-Depth Information

of backgammon (Tesauro 1990 , 1994 ), and it is still used by practitioners (Dann et

al. 2014 ). Despite its use, questions remain concerning why learning occurs under

some problems domains and learning is limited or does not occur in other domains.

Furthermore, our understanding of this algorithm when paired with a neural network

is even more limited. The inconsistencies in the success when using the TD( ʻ )

algorithm with a neural network, as well as both successful and unsuccessful personal

experiences with these methods, were the motivation for this work.

8.1.1

Parameter Effects

The temporal discount factor ʻ essentially scales the influence of weight updates

from previous time steps, and it therefore attenuates how far information (in terms of

weight changes) is propagated through time. In general, the literature suggests that

setting this parameter between 0.6 and 0.8 works well in many applications (Tesauro

1992 ; Patist and Wiering 2004 ; Wiering et al. 2007 ; Wiering 2010 ), and there is

little work that explores setting this to other values (Sutton and Barto 1998 ). In the

mountain car problem, we found that ʻ could take on values over [0, 1] over all

convergent subregions, though each individual subregion had a somewhat smaller

(but still large) range of acceptable values. In the TBU problem, ʻ had to be quite

consistently on the lower end of [0, 1] for all convergent subregions. In the TTBU

problem, although experimentation evaluated a smaller range of ʻ , this parameter had

to be between [0.5, 0.7] for the two convergent subregions. Our results are therefore

both consistent and inconsistent with what is recommended in the literature.

The convergent parameter subregions for the action exploration-exploitation pa-

rameter were also inconsistent for the three domain problems. The mountain car

domain had both small and large ranges for depending on the subregion, with

some of the lower bounds of these subregions extending down to 0.6, which results

in nearly as much action exploration as there is exploitation. In the TBU problem,

the parameter ranges for were also consistent and similar across the convergent

subregions (similar to the mountain car problem), and was required to range from

approximately 0.9- 1.0 (across all subregions). Practically, this means that the ma-

jority of the time, knowledge about which actions to take was learned from trying

actions that were perceived to be useful based on the current knowledge of the agent,

rather than exploring other, non-knowledge based actions. In the TTBU problem,

could range over the entire parameter space of [0.85, 0.97] that was evaluated. The

differences between the specificity of in the TBU and TTBU problems could be

due to the fact that the goal criteria in the TBU problem was stricter, and due to the

larger state- and action-space of the TTBU problem that required more exploration

of the actions.

The next-state discount factor ʳ attenuates the value of the next state value in

the TD( ʻ ) algorithm. Relative to ʻ and , ʳ is paid little attention in the literature,

and some work even neglects including it in the description of the TD( ʻ ) algo-

rithm (Tesauro 1992 ). Some work suggests that no learning will occur if ʳ

=

1

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home