Civil Engineering Reference
In-Depth Information
trade-off for action selection. Action exploration has been found to be very effective
in improving the speed and quality of learning and is a standard training action se-
lection policy. However, the use of a pure exploitative action selection procedure by
Tesauro ( 1995 ) in backgammon also resulted in excellent performance, which may
have been due to the stochasticity of the game itself, thus allowing for implicit state
space exploration (Kaelbling et al. 1996 ).
Training for longer durations, or playing more games, has resulted in mixed re-
sults. Tesauro ( 1995 ) trained a neural network for months (Kaelbling et al. 1996 ),
which resulted in world class performance. However, Wiering et al. ( 2007 ) found
that longer training did not result in marked improvements in playing performance
for their implementation of backgammon. Similarly, Gatti et al. ( 2011a ) found
that training for 100,000 games did not result in superior performance than when
training for 10,000 games. The reason why the backgammon implementation of
Tesauro ( 1995 ) kept increasing performance are unknown and are somewhat puz-
zling because temporal difference algorithms are known to 'unlearn'; that is, their
performance does not necessarily monotonically increase with more training. Expe-
rience replay stores and trains on previously played games, and this method has been
found to be effective in increasing performance in keep-away (Kalyanakrishnan and
Stone 2007 ), though it did not increase maximal performance in the works by Lin
( 1992 ) for a Gridworld-type problem and by van Seijen et al. ( 2011 ) for the mountain
car problem.
The training opponent has also been found to have an impact playing performance
in adversarial domains. Tesauro ( 1995 ) used a self-play training scheme that even-
tually became an excellent player, and this approach has been successfully used in
many other games (Wiering et al. 2007 ; Wiering 2010 ; Gatti et al. 2011b ). As men-
tioned, an alternative training strategy uses database games where the agent observes
games that were previously played by high level opponents. While this seems ben-
eficial, the resulting performance of this method was the worst out of a number of
different training methods in an implementation of backgammon by Wiering ( 2010 ),
and it is speculated that this is due to the fact that the agent cannot test and explore
actions that may potentially be better. In similar work, Wiering et al. ( 2007 ) found
that performance differed very little after training against an expert player versus
using self-play. Schraudolph et al. ( 1994 ) used three openly-available computer pro-
grams with differing levels of expertise to train individual neural networks to play
Go, and found that the final playing ability of each network was quite different.
Generalized domains have also been used to gain an understanding of the behavior
of algorithms under different domain characteristics. Bhatnagar et al. ( 2009 ) used
two different Garnet problems to assess the performance of actor-critic algorithms in
domains with different numbers of state, actions, and branching factors. This work
found that it was easier to find parameter settings for one particular actor-critic al-
gorithm than others, and that there are considerable differences in the convergence
speed of the algorithms tested. The authors note that their parameter study was small
and simplistic and the results of their study is merely suggestive of parameter and
domain effects. Kalyanakrishnan and Stone ( 2009 , 2011 ) compare the performance
Search WWH ::




Custom Search