Reinforcement Learning - Design of Experiments for Reinforcement Learning

Civil Engineering Reference

In-Depth Information

trade-off for action selection. Action exploration has been found to be very effective

in improving the speed and quality of learning and is a standard training action se-

lection policy. However, the use of a pure exploitative action selection procedure by

Tesauro ( 1995 ) in backgammon also resulted in excellent performance, which may

have been due to the stochasticity of the game itself, thus allowing for implicit state

space exploration (Kaelbling et al. 1996 ).

Training for longer durations, or playing more games, has resulted in mixed re-

sults. Tesauro ( 1995 ) trained a neural network for months (Kaelbling et al. 1996 ),

which resulted in world class performance. However, Wiering et al. ( 2007 ) found

that longer training did not result in marked improvements in playing performance

for their implementation of backgammon. Similarly, Gatti et al. ( 2011a ) found

that training for 100,000 games did not result in superior performance than when

training for 10,000 games. The reason why the backgammon implementation of

Tesauro ( 1995 ) kept increasing performance are unknown and are somewhat puz-

zling because temporal difference algorithms are known to 'unlearn'; that is, their

performance does not necessarily monotonically increase with more training. Expe-

rience replay stores and trains on previously played games, and this method has been

found to be effective in increasing performance in keep-away (Kalyanakrishnan and

Stone 2007 ), though it did not increase maximal performance in the works by Lin

( 1992 ) for a Gridworld-type problem and by van Seijen et al. ( 2011 ) for the mountain

car problem.

The training opponent has also been found to have an impact playing performance

in adversarial domains. Tesauro ( 1995 ) used a self-play training scheme that even-

tually became an excellent player, and this approach has been successfully used in

many other games (Wiering et al. 2007 ; Wiering 2010 ; Gatti et al. 2011b ). As men-

tioned, an alternative training strategy uses database games where the agent observes

games that were previously played by high level opponents. While this seems ben-

eficial, the resulting performance of this method was the worst out of a number of

different training methods in an implementation of backgammon by Wiering ( 2010 ),

and it is speculated that this is due to the fact that the agent cannot test and explore

actions that may potentially be better. In similar work, Wiering et al. ( 2007 ) found

that performance differed very little after training against an expert player versus

using self-play. Schraudolph et al. ( 1994 ) used three openly-available computer pro-

grams with differing levels of expertise to train individual neural networks to play

Go, and found that the final playing ability of each network was quite different.

Generalized domains have also been used to gain an understanding of the behavior

of algorithms under different domain characteristics. Bhatnagar et al. ( 2009 ) used

two different Garnet problems to assess the performance of actor-critic algorithms in

domains with different numbers of state, actions, and branching factors. This work

found that it was easier to find parameter settings for one particular actor-critic al-

gorithm than others, and that there are considerable differences in the convergence

speed of the algorithms tested. The authors note that their parameter study was small

and simplistic and the results of their study is merely suggestive of parameter and

domain effects. Kalyanakrishnan and Stone ( 2009 , 2011 ) compare the performance

Design of Experiments for Reinforcement Learning

Search WWH ::

Custom Search

Home